BIOS601 AGENDA: Thursday November 03, 2016
[updated Nov 11, 2016]
Agenda for Thursday November 03, 2016
- Discussion of issues
in C&Hs Chapter 05 (Rates)
and JH's
Notes and Assignment on this chapter
Answers to be handed in for:
(Supplementary) Exercises 5.1, 5.2,
5.3, 5.4, 5.5 5.7
Remarks on Notes:
These notes were developed to supplement the Clayton and Hills chapter,
which was aimed at epidemiologists, and which does not give the
derivations (the 'wiring' and 'theory') below the results (the user's view of the car).
It is important to read C&H first, before JH's notes.
The core topics in this chapter are rates,
both 'CONSTANT-in-time, and
VARYING-over-time.
Remarks on assigned (and non-assigned) exercises .
The exercises are also designed to i. get you familiar with
the Greenwood formula, and with how to obtain K-M and N-A
'curves' via R, ii. appreciate why and by how much they differ,
and when, and iii. see some live examples of survival-analysis
and infection-rate-analysis, and see how sometimes the fact that
interval-censored
observations (such as those from HIV testing) are simplified
in actual analyses, especially if, as in the Kenya and Uganda examples,
simplifying the data doesn't change the estimates very much.
5.1
For part 1, the likelihood has already been set up, so its merely
a matter of maximizing it (before you do so, please check that you
understand how it was et up, and that there are no typos).
For part 2, make the tests more frequent, and the intervals shorter.
Say you had weekly testing.
For the first six months, there would still be the product of 19 terms,
but now they would be 'personalized' .. e.g. the first HIV+ test might be in
a man who was negative at end of week 3 and positive by end of week 4.
In his case, the Likelihood contribution is S(4wks) - S(3wks).
If lambda
is measured in terms of years, then 3 weeks and 4 weeks are
at t.start = 3/52 years and t.end = 4/52 years. Thus the contribution is
exp[- lambda*t.start] - exp[- lambda*t.end]. but now dt is not 0.5 years,
but just dt = (1/52) years, and we can approximate the contribution by a
rectangle with base dt = 1/52 and height = pdf(3.5/52).
Since pdf(t) is lambda * exp(-lambda t), the contribution from this first
event is ??? . Now repeat for the other 18, and see how much simpler it is
that a product of 19 identical ones of the form
1 - exp[ - 0.5 * lambda].
So you can see how this simplifies matters, and with a likelihood
that now has a closed form, you can head to a the MLE directly (i.e., analytically).
This is because when you collect ALL the negative tests
over these 26 weeks, each one involving an exp[-lambda*dt], you will
end up as almost as many of them as there were weeks*persons.
In fact, it will be 26*2319 MINUS the ones that didn't need to be done because
at some points in those 26 weeks, some 19 tested positive and so stopped
being tested. Lets say the average no. of weeks before they tested positive
was 13 weeks, we end up with approx 26*2319 - 19*13 negative tests.
So their collective
contribution to the Likelihood is exp[ - {26*2319 - 19*13}*(1/52) * lambda].
And remember that each of the 19 positive tests
contributes a (approx.) lambda * exp [ - lambda * t], so there will be a
lambda to the 19th power as well as 19 exp[ - lambda * t] terms.
but L is now is quite amenable to taking its log, and setting its
derivative to zero and solving directly for lambda.
The likelihood contributions for the next 26 weeks have the same structure:
(if ) 14 of the form ??? and 2215*26 - 14*13 of the form
exp[-lambda*(1/52)].
The third period is 52 weeks long; we have 12 positive and
(approx) 52*968 - 12*26 negative tests, with their respective contributions.
Once you multiply all these together, you will notice that the ... in the
exp[- lambda * ...] term is still close (2300*(1/2) + 2215*(1/2) + 968*1 years.
But now, with the small dt, each 1 - exp[ - lambda * (1/52) ] is very close to what?
5.2
Rate of de novo mutations and the importance of father's age to disease risk
In past years, some have misunderstood the meaning of a constant
rate, and ignoring the amount of time during which this
rate was operation.
Think of the data from the the child as telling us how many
mutations the father had acquired over his however many years
he had lived up to the time the child was conceived.
Thus, if, for example, the rate was CONSTANT over age, say an average of
2.5 mutations per year lived, the expected number of
mutations acquired by the time the father reached age 30 is
2.5 x 30 = 75.
If the rate was not CONSTANT over age, but (to make a simple
rate-increasing-with-age example) was say
2.0 per year while the father was himself a child, i.e.,
say for the first 15 years, and say 3.0 per year for the next 10
years, and 4.0 per year for the next 5, the expected no. acquired by
age 30 would be 2.0 x 15 + 3.0 x 10 + 4.0 x 5 = 80. The 'linear-in-age'
model is smoother than that, so one can use an integral to compute the
expected no. by the age each father had his child. But you will see
that the form is quite suitable for a linear regression: remember
that in a linear regression with B0 + B1 x X1 + B2 x X2, the
LINEARITY is in the B's: X1 could be age, and X2
could be age-CUBED, and the model is still linear -- in the B's !!
A NON-LINEAR model is one where the expected Y (or function
of the expected Y) is say B0/(B1 x X1 + B2 x X2).
The 'beyond-linear-in-age' rate model will have a slightly
nastier integral for the expeceted value, so it may not be possible to put it into
a 'linear regression' form. In this case, one can maximize
the log-likelihood numerically with say optim.
In all cases, the Poisson model for the observed
counts continues to make sense. But remember that
whereas we make a statistical model for the expected
counts, that model involves the parameter(s) of the
mutation-RATE function, coupled with the age-range over which
the father was at risk of new mutations.
5.3
1. You will probably find it easier and
quicker to maximize each log-likelihood numerically.
However, the closed form estimator is instructive, since
it shows what happens if one made the testing more
frequent and thus the time when the
infection occurred more precise.
2319 initially HIV- men were tested at the 6-month,
or 0.5 year follow-up, and 19 of them were found to be HIV+,
and the remaining 2300 were found to be HIV-.
The probability of observing these are proportional to
S^(2300) times (1-S)^19, where S is the probability of
going a HALF year without becoming HIV+.
But, parametrically, S = exp[- lambda * 0.5] ,
where lambda is the infection rate .. i.e.
lambda = 0.02 means 2 infections per 100
man-years at risk.
So the log-likelihood is 19 log(1-S) + 2300 log(S),
or 19 log(1-exp[ * ] ) + 2300 log(exp[ * ]).
It is easier to see that
MLE for the parameter S is 2300/2319, and then that
the MLE for lambda itself is - 2 times log(2300/2319).
(the 2 is to have the rate
for a full year)
The expansion of log( 1 - x) about x=0 is
-x - x^2/2 - x^3/3 - etc....
So if we we expand log(2300/2319 = 1 - 19/2313), we
get 19/2313 + (19/2319)^2/2 + (19/2319)^3/3 + ...
If we ignore all but the first term we get
lambda.hat.1 = 2 times 19/2319. This is slightly TOO SMALL.
The numerator of 19 is OK, but the DENOMINATOR
of 2319 is TOO LARGE: there were 2319 HIV-
at the beginning of the 6 months, but fewer than that no.
HIV- at the end of the 6 months, so the plot of
No. of persons (y axis) versus time (x axis, up to t=0.5)
does not form a rectangular area of area 2319 x (0.5).
Instead it has 19 departures from the HIV- pool, at 19
(unknown) time locations along the way. The 19 amounts
of HIV+ time should be subtracted from the denominator.
If we ignore all but the first AND SECOND terms we get the
slightly bigger
lambda.hat.2 = 2 times { 19/2319 + (19/2319)^2/2 }.
This second one makes a lot of sense.
Write 19/2319 as 'L1', so that lambda.hat.2 = 2 times
{ L1 + (L1)^2/2 }
Re-arranging lambda.hat.2
so that the numerator is 19,
we have lambda.hat.2 = 2 times 19/???, and solving for
??? , we get lambda.hat.2 = 2 times 19/2309.539.
Think of the 2309.539 as 2319 - 9.461241,
or 2300 + 9.539. The latter is very close to
2300 + 19/2. And so we could think of the total
HIV- person time as 2300 'full' intervals
of 6 months each, and each of the 19 'converters'
as contributing about 1/2 an interval.
Further corrections would refine this further.
At the limit, the - 2 times log(2300/2319) as the
MLE of the infection rate
is 19/1154.743, just as if we had
0.5 years of HIV- time from each
of 2 x 1154.743 = 2309.487 men. It would take a few
more correction terms to reach this.
5.4
5.5
5.7