When is using attributable risk (AR) a good thing and when is something like a hazard better?
AR can be a time independent thing which gives the proportion of total deaths, say, that are due to an infection, say. Or the proportion of the total population that would have been spared if they hadn't been infected.
A temporal version is available that consists of the cdf of the event times which in the limit equals the independent case.
For rare events i.e. probability of survival approximately 1 then the formula can be posed in terms of hazard functions. It consists of the total hazard and the uninfected conditional hazard.
Friday, November 9, 2012
Monday, October 29, 2012
Pre-headlines
I'm now a fully signed-up contributor to the Before the Headlines team. The idea is to support journalists in getting the right science stories out there in the best way. That means nipping unsupported, fanciful claims in the bud and getting the messages across in a clear and accurate way, but without boringafying/boringising.
More thoughts on the expected excess LOS
I realised that the expected LOS at a given time $s$, denoted $E[T|s]$, is the same as the life expectancy I'm more used to in life tables. The most common $s$ in this case is life expectancy at birth $E[T|s=0]$, which would be like stay expectancy at admission in the HCAI model.
Of course, life expectancy can be estimated from any age. Your life expectancy changes at you get older by the simple fact that you've already survived up to that point, so your life expectancy at 60 is probably more than your life expectancy at 0.
What we are interested in for the excess LOS measure is average difference LOS over all times in hospital. This is equivalent to the average life expectancy over all ages i.e. we don't know what age some person is so what would our best guess of how long they'll live be. Since a population is not evenly distributed over ages it would make sense to weight the more likely ages more heavily and the less likely ages less so. [I think this is called the overall life expectancy and is closely related to the all-age-all-cause mortality.]
The excess LOS is similar but slightly different. In this case, we're looking at the difference in 2 holding times- the infected "life expectancy" and the uninfected "life expectancy". Equivalently to the life table context, we want to weight the times at which there are more individuals in the states of interest- not simply the alive state but the infected and non-infected states.
So we prefer times with high probability of being in state 0 and high probability of being in state 1, Put another way, we prefer times with low probability of being in state 2, the death/discharge state.
The probability of being in state 0 is simply the survival probability of not having left it before some time.
The probability of being in state 1 at $s$ is the probability of having jumped to it at some time before $s$ and not having left.
What might be easier is to think in terms of 1-P(being in state 2 at time $s$) since this is a sink state, this is the probability of having entered it at an time up to time $s$ only i.e. a c.d.f.
Of course, life expectancy can be estimated from any age. Your life expectancy changes at you get older by the simple fact that you've already survived up to that point, so your life expectancy at 60 is probably more than your life expectancy at 0.
What we are interested in for the excess LOS measure is average difference LOS over all times in hospital. This is equivalent to the average life expectancy over all ages i.e. we don't know what age some person is so what would our best guess of how long they'll live be. Since a population is not evenly distributed over ages it would make sense to weight the more likely ages more heavily and the less likely ages less so. [I think this is called the overall life expectancy and is closely related to the all-age-all-cause mortality.]
The excess LOS is similar but slightly different. In this case, we're looking at the difference in 2 holding times- the infected "life expectancy" and the uninfected "life expectancy". Equivalently to the life table context, we want to weight the times at which there are more individuals in the states of interest- not simply the alive state but the infected and non-infected states.
So we prefer times with high probability of being in state 0 and high probability of being in state 1, Put another way, we prefer times with low probability of being in state 2, the death/discharge state.
The probability of being in state 0 is simply the survival probability of not having left it before some time.
The probability of being in state 1 at $s$ is the probability of having jumped to it at some time before $s$ and not having left.
What might be easier is to think in terms of 1-P(being in state 2 at time $s$) since this is a sink state, this is the probability of having entered it at an time up to time $s$ only i.e. a c.d.f.
Tuesday, October 23, 2012
Hospital Length of Stay (LOS)
The main equation in this work is
$ f(s) = E[T|X_s=1] - E[T|X_s=0] $ (*).
This is the difference between the expected time between admission and discharge/death, $T$, given infected at time $s$ and given not infected at time $s$.
This means that those individuals who are not infected could be so in the future. This is a snap shot of the case-control split in the sample at time $s$, so it doesn't tell us about what will happen after this. Those in state 0 at time $s$ may become infected for some time in the future or they may not pass go and jump straight to the sink state.
Intuitively, if we think about this in a latent time/counterfactual way then when patients are still in state 0, even if they do jump to state 1 (infected) later, they still would have jumped to the sink at a time after that.
By setting up like this we include the holding time in state 0 prior to either an infection or death/discharge as a non-infection length of stay time, and so not biasing the infection LOS times by not accounting for the two-way causality.
For example, the times when transitions from 0 to 1 occur are the times before which the jumping individuals and those that remain in state 0 have been in state 0 together. That is, they have the same history (filtration) up to that time, say $s$ and then diverge at that time. So a comparison on the LOS between these two groups is a comparison accounting for the uninfected time too i.e. time-dependent. Conversely, the times at which transitions from 0 to 2 occur are those individuals that do not have an associated other group who transition from 0 to 1. So at this time $s$ we're are cleaning-up the sample to remove individuals that aren't helpful in the comparison between the infected and non-infected groups.
This rational is done probabilistically over the continuous variable $s$, rather than at discrete time points used above. This approximation could be useful for checking though.
The excess LOS is a weighted mean estimate of the separate LOS for each $s$. If we think about this as a sample size problem, we would place more weight on the larger samples and less on the smaller sample sizes. In essence, this is really placing emphasis on the points that contain more information. In the LOS context this would correspond to placing more weight on the times at which there has just been a transition to state 1, the infected state. Obviously, the infected individuals are most likely to be in this state at the beginning of their holding time.
Beyersmann also includes the times at which there has been a transition from 0 to 2, the death/discharge state. This is a removal of individuals that no longer contribute to the case-control comparison. To me this is a less obvious thing to use in the weights.
If we think about it for a countable set of uniformly spaced $s$, then the proportion of the interval an interval $[0,T]$ comprising of admission time and the proportion comprising on infection time will determine the influence of $T$ on the non-infected and infected LOS respectively.
Now, it we position the times $s$ non-uniformly, so that they are closer to the times when there are more transitions and further from the times where there are fewer transitions then we will pick-up more of the detail and fidelity of the process.
As the days progress the espected LOS for both infected and not infected will obviously increase. But the probability of having left state 0 will increase as the population continues to diminishes and be absorbed in state 2, sincehe survival function is monotonically decreasing.
$ f(s) = E[T|X_s=1] - E[T|X_s=0] $ (*).
This is the difference between the expected time between admission and discharge/death, $T$, given infected at time $s$ and given not infected at time $s$.
This means that those individuals who are not infected could be so in the future. This is a snap shot of the case-control split in the sample at time $s$, so it doesn't tell us about what will happen after this. Those in state 0 at time $s$ may become infected for some time in the future or they may not pass go and jump straight to the sink state.
Intuitively, if we think about this in a latent time/counterfactual way then when patients are still in state 0, even if they do jump to state 1 (infected) later, they still would have jumped to the sink at a time after that.
By setting up like this we include the holding time in state 0 prior to either an infection or death/discharge as a non-infection length of stay time, and so not biasing the infection LOS times by not accounting for the two-way causality.
A weighting game
Eqn (*) is averaged over all $s$ to give an expected excess length of stay. The question is how to choose the weightings. It is suggested to weight the days when there are more infections more heavily or when there are more jumps out of state 0, regardless of whether to infection of death/discharge. By weighting in this way more emphasis is placed on the excess LOS on days when more happens, That is to say that when there are larger changes in the state populations and risks set then the difference in LOS is more influential on the estimate. Intuitively, this makes sense since otherwise we will count days when there is little or not change in the system.For example, the times when transitions from 0 to 1 occur are the times before which the jumping individuals and those that remain in state 0 have been in state 0 together. That is, they have the same history (filtration) up to that time, say $s$ and then diverge at that time. So a comparison on the LOS between these two groups is a comparison accounting for the uninfected time too i.e. time-dependent. Conversely, the times at which transitions from 0 to 2 occur are those individuals that do not have an associated other group who transition from 0 to 1. So at this time $s$ we're are cleaning-up the sample to remove individuals that aren't helpful in the comparison between the infected and non-infected groups.
This rational is done probabilistically over the continuous variable $s$, rather than at discrete time points used above. This approximation could be useful for checking though.
The excess LOS is a weighted mean estimate of the separate LOS for each $s$. If we think about this as a sample size problem, we would place more weight on the larger samples and less on the smaller sample sizes. In essence, this is really placing emphasis on the points that contain more information. In the LOS context this would correspond to placing more weight on the times at which there has just been a transition to state 1, the infected state. Obviously, the infected individuals are most likely to be in this state at the beginning of their holding time.
Beyersmann also includes the times at which there has been a transition from 0 to 2, the death/discharge state. This is a removal of individuals that no longer contribute to the case-control comparison. To me this is a less obvious thing to use in the weights.
If we think about it for a countable set of uniformly spaced $s$, then the proportion of the interval an interval $[0,T]$ comprising of admission time and the proportion comprising on infection time will determine the influence of $T$ on the non-infected and infected LOS respectively.
Now, it we position the times $s$ non-uniformly, so that they are closer to the times when there are more transitions and further from the times where there are fewer transitions then we will pick-up more of the detail and fidelity of the process.
As the days progress the espected LOS for both infected and not infected will obviously increase. But the probability of having left state 0 will increase as the population continues to diminishes and be absorbed in state 2, sincehe survival function is monotonically decreasing.
Tuesday, October 9, 2012
Instantaneous measures
I've been reading about Influence functions,
$${d}/{d ε}(t((1-ε)F+ε I_{[y,∞)}))_{ε=0}$$.
These are used to quantify the influence a given data point has on a statistic $t$.
I was thinking about this as an instantaneous rate in the same sense as a hazard function.
In the limiting notation we can see how they are both types of averages across an increment and then the increment is decreased to approach 0 from above to give the derivative in that direction at a point.
So the influence function is an average difference between the statistic of interest between the 2 distributions (one being a mixture distribution). The hazard rate is an average probability of transitioning within the time interval.
The influence function is slightly different to the hazard rate because it includes a wieghted sum of cdfs to sum to 1 in the functional.
$${d}/{d ε}(t((1-ε)F+ε I_{[y,∞)}))_{ε=0}$$.
These are used to quantify the influence a given data point has on a statistic $t$.
I was thinking about this as an instantaneous rate in the same sense as a hazard function.
In the limiting notation we can see how they are both types of averages across an increment and then the increment is decreased to approach 0 from above to give the derivative in that direction at a point.
So the influence function is an average difference between the statistic of interest between the 2 distributions (one being a mixture distribution). The hazard rate is an average probability of transitioning within the time interval.
The influence function is slightly different to the hazard rate because it includes a wieghted sum of cdfs to sum to 1 in the functional.
Sunday, September 30, 2012
Degenerate baseline hazard
"Degenerate" in this context means taken to an extreme level so, for example, a straightline is a degenerate version of a triangle. The degenerate baseline hazard seems to me to be one where its obvious given the assumptions and definitions.
$$h_0(t_l|t=s) = {1}/{∑↙{t_j >= t_l} \exp(X_j \beta(s))}$$.
Rearranging this gives
$${∑↙{t_j >= t_l} h_0(t_l|t=s) \exp(X_j \beta(s))}=1$$.
Which says, the sum of the hazards for each case in the risk set just after time $t_l$ (i.e. good as $t_l$ really) is one. By definition there is an event at time $t_l$. We know that there are no ties so there is only one. So, the summation is when there are at least one but no more than one events at time $t_l$. We don't need to consider the cases when there may be more than one event at that time or the probabilities of there not being an event, leaving us with a simple addition.
$$h_0(t_l|t=s) = {1}/{∑↙{t_j >= t_l} \exp(X_j \beta(s))}$$.
Rearranging this gives
$${∑↙{t_j >= t_l} h_0(t_l|t=s) \exp(X_j \beta(s))}=1$$.
Which says, the sum of the hazards for each case in the risk set just after time $t_l$ (i.e. good as $t_l$ really) is one. By definition there is an event at time $t_l$. We know that there are no ties so there is only one. So, the summation is when there are at least one but no more than one events at time $t_l$. We don't need to consider the cases when there may be more than one event at that time or the probabilities of there not being an event, leaving us with a simple addition.
Friday, September 28, 2012
Landmarking
Sold as an easy but less revealing alternative to the multistate model approach, the landmarking approach picks a grid of time points and, using the risk set at that time, calculates the Cox regression upto some set time horizon. Some function of the fitted betas in the additive model is used to link the separate fits e.g. a linear combination. I've been going off this paper.
As with the etm, I've just used some sample data from the IMPACT clinical model to try this method out.
So far, I've used R to produce the Cox regressions at each landmark point but the paper then generated predictions of the survival probabilities from these point by estimating the baseline hazard (and so baseline cumulative hazard) to go with the regression parameters.
I thought I'd run the multistate model code with the RR/hazard adjustments for interventions from the IMPACT paper and try and recreate the same figures. So these are the proportions of incident cases that die from each outcome. Then I could repeat this, with my code, at different landmark time points and see how the hazard ratios change.
As with the etm, I've just used some sample data from the IMPACT clinical model to try this method out.
So far, I've used R to produce the Cox regressions at each landmark point but the paper then generated predictions of the survival probabilities from these point by estimating the baseline hazard (and so baseline cumulative hazard) to go with the regression parameters.
I thought I'd run the multistate model code with the RR/hazard adjustments for interventions from the IMPACT paper and try and recreate the same figures. So these are the proportions of incident cases that die from each outcome. Then I could repeat this, with my code, at different landmark time points and see how the hazard ratios change.
Monday, September 17, 2012
CHD multi-state model transition probabilities
I've been using the etm packeage in R to produce the empirical transition probabilities using the CHD simulation data from IMPACT. These are some of the output plots

where (because of space):
1="AMI"
2="CA",
3="Early HF",
4="Healthy"
5="MI Recur",
6="MI Surv",
7="SD",
8="Severe HF",
9="UA",
10="CHD Death",
11="Non CHD Death"
Below is the empirical transition matrix for 60->90 year olds i.e. $\widehat{P}(60,90)$,
where (because of space):
1="AMI"
2="CA",
3="Early HF",
4="Healthy"
5="MI Recur",
6="MI Surv",
7="SD",
8="Severe HF",
9="UA",
10="CHD Death",
11="Non CHD Death"
Below is the empirical transition matrix for 60->90 year olds i.e. $\widehat{P}(60,90)$,
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | |
| 1 | 0 | 0 | 0 | 0 | 0 | 0.0317885 | 0 | 0 | 0 | 0.545821 | 0.4223904 |
| 2 | 0 | 0.0980707 | 0 | 0 | 0 | 0.0317717 | 0 | 0 | 0 | 0.3690474 | 0.5011102 |
| 3 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 0 | 0.0265604 | 0 | 0 | 0 | 0.0234626 | 0 | 0 | 0 | 0.6691671 | 0.2808099 |
| 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.8835166 | 0.1164834 |
| 6 | 0 | 0 | 0 | 0 | 0 | 0.0317885 | 0 | 0 | 0 | 0.545821 | 0.4223904 |
| 7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 9 | 0 | 0 | 0 | 0 | 0 | 0.0327666 | 0 | 0 | 0 | 0.5336655 | 0.4335679 |
| 10 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 11 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
Sunday, September 16, 2012
Product Integrals
I've just come across the term product integral whilst reading a survival analysis paper. I'm surprised that it's the first time I've seen this but it seems that the idea went out of fashion and has only really been promoted in survival analysis circles because of the usefulness linking cumulative hazards and Kaplan-Meier estimates.
The idea is simple, especially when you know what regular, run-of-the-mill integration is. Where (sum) integration is the asymptotic limit of the sum of smaller and smaller intevals beneath a curve i.e. the continuous analogue of a summation, then the product integral is the product of smaller and smaller powers so these values approach 1 from above i.e. this is the analogue of taking products instead of sums.
The current notation I've seen a lot of was proposed by Gill and Johanson. I found this article by Gill useful in explaining what's going on.
The idea is simple, especially when you know what regular, run-of-the-mill integration is. Where (sum) integration is the asymptotic limit of the sum of smaller and smaller intevals beneath a curve i.e. the continuous analogue of a summation, then the product integral is the product of smaller and smaller powers so these values approach 1 from above i.e. this is the analogue of taking products instead of sums.
The current notation I've seen a lot of was proposed by Gill and Johanson. I found this article by Gill useful in explaining what's going on.
Friday, August 31, 2012
Conditional survival times
Using this paper explaining how to generate conditional survival times and the survival library in R I created some example.
Here's an example for the Weibull distribution an a conditional T>2 event time against the non-conditional case.

The survival function is given by
$$ 1-F(t) = S(t) = \exp(-H(t))$$
So, by inverting the formulae we can generate event times using the cumulative hazard function
$$ T = H^{-1}(-\log U)$$
Now, substituting in the conditional hazard function
$$ S(t|t>T_k) = \exp(-a(H(t) - H(T_k)))$$
And rearranging as before gives
$$ T = H^{-1}(-{\log U}/{a} + H(T_k)) - T_k.$$
The code looks like this
library(survival)
n <- 10000
nbreak <- 100
lamb <- 2
r <- 1 # hazard adjustment
tmin <- 1 # conditional minimum event time e.g. intervention time
nu <- 2
# Weibull
invH_Wei <- function(t){
(1/lamb * t)^(1/nu)
}
T <- invH_Wei(-log(rv_unif)/r)
H_Wei <- function(t){
lamb * (t^nu)
}
T2 <- invH_Wei(-log(rv_unif)/r + H_Wei(tmin)) - tmin
Tcond <- T2 + tmin
plot(survfit(Surv(Tcond)~1))
lines(survfit(Surv(T)~1))
Here's an example for the Weibull distribution an a conditional T>2 event time against the non-conditional case.
The survival function is given by
$$ 1-F(t) = S(t) = \exp(-H(t))$$
So, by inverting the formulae we can generate event times using the cumulative hazard function
$$ T = H^{-1}(-\log U)$$
Now, substituting in the conditional hazard function
$$ S(t|t>T_k) = \exp(-a(H(t) - H(T_k)))$$
And rearranging as before gives
$$ T = H^{-1}(-{\log U}/{a} + H(T_k)) - T_k.$$
The code looks like this
library(survival)
n <- 10000
nbreak <- 100
lamb <- 2
r <- 1 # hazard adjustment
tmin <- 1 # conditional minimum event time e.g. intervention time
nu <- 2
# Weibull
invH_Wei <- function(t){
(1/lamb * t)^(1/nu)
}
T <- invH_Wei(-log(rv_unif)/r)
H_Wei <- function(t){
lamb * (t^nu)
}
T2 <- invH_Wei(-log(rv_unif)/r + H_Wei(tmin)) - tmin
Tcond <- T2 + tmin
plot(survfit(Surv(Tcond)~1))
lines(survfit(Surv(T)~1))
Sunday, August 26, 2012
Population distribution in Manchester
I was just playing with the Census 2011 data on the ONS site and saw an interesting pattern. It seems that the population in Manchester has a proportionately large number of 20-30 years olds compared with the rest of the country. I imagine this is due to students who decide to stay after their studies and people being less inclined to head to the country to have kids, at an earlier age at least.
I'm going to call these distributions the CHRISTMAS TREE and the GHERKIN.
Subscribe to:
Posts (Atom)