BIOS601 AGENDA: Monday October 24, 2014

[updated Oct 19, 2016]

Agenda for Monday October 24, 2016

Discussion of issues in C&Hs Chapter 04 (Consecutive Follow-up Intervals) and JH's Notes and Assignment on this chapter

Answers to be handed in for: (Supplementary) Exercises 4.1, 4.2, 4.8, 4.9, 4.10, 4.11, 4.12

Remarks on Notes:

These notes were developed to supplement the Clayton and Hills chapter, which was aimed at epidemiologists, and which does not give the derivations (the 'wiring' and 'theory') below the results (the user's view of the car).

It is important to read C&H first, before JH's notes.

The core topics in this chapter are non-parametric (or more precisely, distribution-free) approaches to estimating survival curves, and the associated functions (e.g., pdf and hazard function) that can be derived from them. Last week, in the orientation to ML estimation, several of our examples involved specific candidate distributions for rv's that take on values on the (0,Inf) scale, such as exponential, gamma, log-normal etc. But, probably to your surprise, you will in one of the exercises learn that whereas the Kaplan-Meier estimator is usually described as a non-parametric estimator, it can also be shown to be THE survival curve, among ALL POSSIBLE SURVIVAL CURVES that makes the observed data most likely; and so it is sometimes referred to as a non-parametric MLE -- almost a contradiction in terms, especially when we emphasize that for ML one needs to specify a distribution with a full (parametric) form.

C&H develop the K-M estimator very 'naturally' by slicing time finer and finer, so that most conditional survival probabilities in the product are unity, and can be omitted, leaving just the (less than unity) conditional survival probabilities for the time-bands that contain >= 1 event.

One could begin even further back, and consider what the empirical cdf(t) and thus its complement, the empirical S(t), would look like if there were no censoring. In this case, when we got to the t where a cumulative total of k subjects have made the transition from the initial state, the empirical S(t) would be

[(n-1)/(n)] x [( n-2)/(n-1)] x [(n-3)/(n-2)] x ... [(n-{k-1})/(n-{k-2})) x [(n-k)/(n-{k-1})

and this simplifies, because k-1 terms on the top would cancel the same k-1 terms in the bottom, to (n-k)/n.

In the K-M version, its the same structure, BUT (because of censoring) not all of the 'survivors' of one time-band experience the next time-band. The no. at risk (the risk set) gets progressively smaller, not just because of the transitions, but because of the 'staggered' entries or the lost-to-follow-up.

-----

Since the C&H book was written, the Nelson-Aalen estimator has become more popular, and it is now found in all good survival analysis packages. So it deserves some study, and to be properly understood. As JH notes at the end of his expository article, there is some confusion as to what a N-A curve means, since it is often taken to mean integral of the estimated hazard function (JH thinks this is the more common meaning). But it is sometimes used to refer to the survival curve = exp[-integral of the estimated hazard function] that one can derive from the integrated hazard function. If we want to think of K-M and N-A curves 'in parallel', then it is this latter downwards travelling, N-A step-function, taking on values from 1 to 0, version makes the N-A step-function and the K-M step-function very close cousins.

Remarks on assigned (and non-assigned*) exercises .

(* This year, we will not cover the material behind Q4.3-Q4.5 (KM is the Non-parametric MLE or NPMLE; distribution-to-the-right ; and a 'self-consistent' estimator. These are additional insights that PhD students in Biostatistics should know, and be prepared to answer in the comprehensive examination; Q4.7 addresses another way to look at part ii of Q1, and provides some insight for the N-A estimator in Q4.8 and Q4.9)

The exercises are also designed to i. get you familiar with the Greenwood formula, and with how to obtain K-M and N-A 'curves' via R, ii. appreciate why and by how much they differ, and when, and iii. see some live examples of survival-analysis and infection-rate-analysis, and see how sometimes the fact that interval-censored observations (such as those from HIV testing) are simplified in actual analyses, especially if, as in the Kenya and Uganda examples, simplifying the data doesn't change the estimates very much.

Before you start, you might wish to look at the Bridge-of-Life animations on JH's website.

http://www.biostat.mcgill.ca/hanley/BridgeOfLife/

4.1

Results 1 and 2 are fundamental.

Result 1 is often used to estimate the mean duration, based on censored data that have been converted to an (effectively) empirical CDF or 1-CDF. It was used by one of our seminar speakers last year to calculate the average post-treatment survival for patients with newly-diagnosed bladder cancer.

Incidentally, if you look at the blue curve in the animation (or look at the ages of the tumblers at any time after the process has reached steady state) you will see that it gives the current-age distribution of the living.

The total number of deaths is obtained by multiplying the hazard function at each age to the persone-years lived in the corresponding age-bin, and summing them. How does that fit with what you are asked to prove in part 1? Does it not just confirm the definition of the hazard function? deaths in age-bin a = hazard at a x { [S(a) x da }, or hazard(a) = density(a) / S(a).

So for each age-bin to generate the appropriate number of deaths, i.e, to generate the correct distribution of ages at death, the numbers alive in bin 'a' must be [proportional to] S(a).

Result 2 is also fundamental, especially if you want to understand where the Nelson-Aalen estimator of S(t) comes from. Most times, the result is derived by solving the differential equation implicit in the definition of the hazard function. But see question 4.7 for another way to look at it. This alternative derivation also ties in with the idea of 'generations' that we touched on when dealing with the Poisson distribution (see the 'Poisson generations' graph on the webpage)

4.2

One message from this question is how little one loses by discretizing time. Once, when plotting methods were still very crude and inaccessible, JH once saw a physician making a K-M curve using a crude program. Because the person had measured survival in days, the number of steps was very large, and it was hard to draw them all manually. The person was very relieved to be told that he would not lose much information by rounding the times to the nearest week or month.

The point of part 3 of this question is to give some intuition for the name 'product limit' that K and M gave their estimator. JH has put the full text of their 1958 'classic' on the webpage. Meier died in 2011; see NYTimes or Google "Paul Meier obituary'.

Before testifying in court in 2013, JH was very glad to read Meier's advice on what (not) to do in such situations. See his article in the American Statistician Feb 2013 Issue on Statistics and the Law.

4.3

As we remarked above, this aspect of the K-M estimator is unusual. But why not think of it this way: imagine you can choose ANY distribution you wish, (as long as it's a legitimate cdf) and that its cdf is simply called a 'no-name-cd'f (it could have vertical jumps, and not be a smooth functio such as we have entertained so far) Then in this example, what would the Likelihood be? Wouldn't it be (no matter what cdf or S(t) we choose),

prob[1st observation | this cdf or this S(t) function]
x
prob[2nd observation | this cdf or this S(t) function]
x
prob[3rd observation | this cdf or this S(t) function].

Since the 2nd observation is that the transition (event) will occur at some time point after t=7, i.e., it is a right-censored at 7, prob[2nd observation | this cdf or this S(t) function] is S(7) or 1-CDF(7), So you read off this from the candidate S(t) function you are 'trying on' for size.

For the likelihood contribution from the 1st observation, we note that this is an uncensored observation, or if you like, 'interval-censored' within a narrow interval that contains the value 5. We need the probability of observing this. Shouldn't we, by analogy with when we are constructing an empirical cdf for n uncensored values, put a probability 'spike' or 'point mass' at t=5? The question is how much to put? If all n observations were uncensored, we would put a mass of 1/n at each value.

Likewise, we would need to put some probability mass at t=10. The question is where else (if anywhere) should we put some mass? how about 1/3 at t=5, 1/3 at t=10, and the other 1/3 spread out uniformly over the interval t=7 to t=9 say. If we did this, the S(t) curve would equal 1 until t=5, take a vertical dive at t=5, and then head horizontally (at a height of 2/3) until t=7, then head downwards from t=7, until it reaches S(9)=1/3, then head straight across to S(10) =1/3, then down to S(10+)=0.

We can now calculate the L under this 'candidate' S(t) function:

1/3
x
2/3
x
1/3.

= 2/27

How about 1/3 mass at t=5, 1/2 mass at t=10, and the other 1/6 mass spread out uniformly over the interval t=7 to t=9 say. If we did this, the S(t) curve would equal 1 until t=5, take a vertical dive at t=5, and then head horizontally (at a height of 2/3) until t=7, then head downwards until it reaches S(9)=1/2, then straight across to S(10)=1/2, then head straight down to S(10+)=0.

The L under this 'candidate' S(t) function is:

1/3
x
2/3
x
1/2

= 2/18, better than before.

If you keep reducing the mass between 7 and 9, and instead placing it at t=10, until t=you get to the S(t) function described in the question, you get the L under this 'candidate' S(t) function as:

1/3
x
2/3
x
2/3

= 4/27, better than any others.

This suggests that to maximize L, we should only put probability mass at the times of the events (the so-called 'failure' times), and NONE at the CENSORED times.

The question then is how much at each 'failure' time.

4.4

JH takes 'self-consistency' to mean that 'it obeys its own laws and definitions'. Thus, in the equation, at each time t, we would estimate the proportion who exceed it by counting all those who did, and by (circularly) using the full S(t) function to make an estimate for each censored observation where we are not sure as to whether it will or not exceed t. Clearly, the closer it already is, the higher the probability that it will exceed to. The use of the fraction S(t)/S(T_i), rather than a 0 or 1, or an 'I don't know', reflects this thinking.

To answer the self-consistency, take the given S(t) function as correct, and check at sufficient t values to be sure the equation involving it holds true.

When re-distributing to the right, JH is reminded of the person moving out of an apartment who gives the sofa to the new person coming in, and so on and so on.

4.5 and 4.6

This question (relating to the conservation of probability, and optimum placement of it to maximize the likelihood, is a very useful concept. Its use is not limited to the easy (1-pass) case of right-censored data. It also applies to right- and interval-censored data.

A former 556-557 teacher, Alain Vandal, did his PhD work on such data, and has an R package ('Icens'). ( Google "Alain Vandal" and "Constrained Estimation and Likelihood Intervals for Censored Data". Ming Gao Gu, also formerly at McGill, also worked on this problem.

4.7

JH is think of restructuring this teaching article so as to focus more on the N-A aspect, which is new for older statisticians and epidemiologists.

In April 2013, The Editor of Epidemiology wrote me:

"Dear Jim: I have a suggestion that might help you convey your ideas effectively: Change the title to "Battle of the survival functions: The challenger, Nelson-Aalen vs. the champion, Kaplan-Meier. Intuition, theory and practice." You can explain what the two estimators do, and show the different underlying theory that you have provided. But you would also need to evaluate conditions under which the estimators give discordant messages, and which estimator is better for various purposes when they are discordant. If you're willing to tackle this, and if the new version appears to be a useful contribution, I will send it for review. If this seems too discouraging, I would certainly understand. In that case, please notify our office so we can close your file. "

What do you think?

4.8

Part 1 asks you to write out the ID(t) function, integrate it, and plug it into the general formula linking S(t) and the integral.

In Part 2, you can start at either estimator and show it when it is close to the other one.

In Part 3, in practice, it is best to work with with the variance of the log of the estimator of S(t), and to form CI's in the log scale, then back-calculate at the end. So, you might want to focus on the variance of the log, and not bother with the variance of the antilog of the log.

Incidentally, textbooks are divided on whether to take d_j as Binomial or Poisson. JH favours the Poisson, as he thinks of it as a Poisson count within a time interval dt, rather than a binomial 'at' time t. You will see both variance forms. Of course, in most applications, with n_j fairly large, and d_j small, the difference is inconsequential.

4.9

JH expects that at first glance, most people who look at the step function in Fig 4 in the teaching article take it to be a survival curve. BUT IT IS NOT. This curve is simply the Person-time curve, and that it drops off at the time each observation is censored OR discontinues.

It is becoming common in better journals to show K-M or N-A curves (or their complements) with some additional data-rows at the bottom, showing the numbers 'at risk' at various selected time points in the followup. You can think of this step function as the ultimate (most information) version of such data. The area is the sum of the products of numbers of persons and time.

The questions re labels are meant to emphasize how important they are. If you want to see how silly or uninformative some labels, are, and how any lay person would be confused by them, look at some of the medical journals. For JH, this is part of a biostatistician's communication responsibility for communication. And just because the graphs are pretty, or came out of Stata doesn't make them correct or clear.

4.10

In previous years, I noticed that several students did the calculations 'by hand'. If using R, please use vectors to automate the calculations.

You might not agree exactly on the 'difference' or on its standard error.

The CI for the Risk ratio was probably calculated by working in the log-of-the-ratio scale, and using a result we derived earlier for the log of a r.v. Here, we do not have two clean binomials, as the numbers of persons being tested are fewer further out in the follow-up time. You should use the Greenwood variance for each of the two 'Risk' estimates (it takes care of the diminishing numbers at risk) then calculate the CI for the log of their ratio. i.e., use the square of each SE as the variance of the estimated risk, and apply the delta method.

4.11

4.12