[updated March 2, 2010]

Assignment 4, 2010 .. questions 1 and 3

Sent: March 1, 2010 3:46 PM
To: James Hanley, Dr.
Subject: EPIB 634 Asst 4


We have a question in regards to Q3ii. We have attempted to calculate the person time for everolimus and the number of people in which the primary endpoint occurred. The number of events at 10 days is not given, is this supposed to be read off of figure 2a.

I wanted people to use the raw (patient-level) data on the website (sorry if that was not clear.. but then people did use these data for the earlier K-M assignment.

ID for interval 1-10 is events in that period divided by p-days in that interval

computer file has the events, so if you just sort the time you will see where tey are (or use fancier calculations)


p-t in days 11-360 is sum of all the days from 1-10 MINUS the days in the first interval)

days in first interval
= 10 for each patient who gets past day 10
=`day' for those who have event <= 10


# via R code

# add 1 to all days, so no failures at exactly t=0
# (most survival analysis programs cannot handle [ie will exclude!]
# events that coincide with entry)

ds$day = ds$day+ 1

# to track the time-slices for each person, give each patient an ID


# cutpoints to create FU intervals for use in survSplit function

cutpoints = c(0.5, 10.5) # for 2 intervals 0-10 and 10-360


split$pt.days = split$day- split$start.day



If so, is there a more efficient way then expanding the image and using graphics editor program to make an educated guess as to the number of events and total person time in this interval.

NOT USING IMAGE.. use the data I extracted..
(actually I extracted them from the image, but by a sneaky way I will explain on Thursday)

If so, the number at risk below figure 2a, for everolimus, the numbers given below (the number who survived) does not correspond with the number of primary end point events in the abstract at 12 months...

YES THE NO.s are S L I G H T L Y off.. but they key is to get the 2 numerators (for 1-10 and 11-2360 correct (being off by a few p-d for den is not a big deal)


Sent: March 1, 2010 12:05 PM
To: James Hanley, Dr.
Subject: Question regarding the assingment for Thursday

Professor Hanley,

In question 1, part i of the assignment, we wanted to see if we are on the right track:

In calculating the five-year risk, we used the fifth year incidence density (despite the fact that the ID changes between time intervals) and multiplied this number by five (5 years) to estimate # of events over five year period (mu = 0.025 events per year, over five years, we would have 0.125 events per 5 year period).

START AT T=0 and work way forward.
the ID is lower early on, since people at YOUNGER but why take worst rate and use it for 5 yrs?

To get S(t), we took the negative exponent of the above figure and got 0.88. Obviously, based on the equations provided, we calculated CI as 0.12, where we explained that there is 12% risk in individuals not treated with Rosuvastatin in the five year period. Is this a right approach?


Or should we have taken into consideration of two other ID given in the question (first two years, year 3 and 4)?


If so, how should we factor two other ID along with the fifth year ID into calculating an overall ID for the five year period (and ultimately calculating S(t))?

you are using 2.5/100PY all way thru.. see picture.. integral is a bit lower than that!
NOT SURE WHY YOU took approach you did when I said you should get the integral of the ID fn. It serves as the Expected no. of events if 1 person at risk every day for 5 years (can replace person with another who is out the same time distance from start of tx)

And for the question 1, part iii, we were hoping to answer this question by suggesting that the event rate in part i is based on the fifth year incidence density to predict the five year survival/risk whereas in part ii we estimated the incidence density based on the 12-month average (where in fact, on the 12th month, the actual incidence density of PI may be lower than 0.0975 due to the gain in intern experience). Basically, in part ii, we are assuming constant rate of incidence density (obviously averaged rates of ID over the 12 month period - some months will be higher, some months will be lower than the average) in predicting S(t) for the 12 months of experience. The reason for CI in part i being similar to ID is the fact that we are using fifth year ID to calculate S(t) for a five year period, where, in fact, the ID rates change over the five year period. In part ii, we take this into a consideration by using the averaged ID per month.

mu= 1.17 (12 mo) ; what is exp(-mu)

Are we on the right track?

NO...IF mu = the Integral of the ID (the exp. no events) is large what is exp(-mu)? = PoissonProb[0 events]?
try mu= 0.01, 0.1, 1, 2, 5 and calculate exp(-mu) for each and compare with mu
see what you learn..

Assignment 4, 2008


On 3/16/08 2:40 AM, "Student X.Y" wrote:

Dr.Hanley hello

I have frustrated myself trying to figure out how to do question 2i of the assignment: I can't make the command work for summing all 'dac' limited to year 71 AND also limited to ages 40-85. ....so that after having this sum and that for the Person years, I can calculate the crude all cause mortality rate..

will you please let me know how to do this? I know i can manually sum it up to come up with crude rates but I would really like to know the R code.

much appreciated

I thought I was bad quitting the computer at 2.30 am !!

The code is in the Resources opposite the assignment and the topic

One way is to extract the data fist, as is done there in the code


ds = females[

          (females$year==71 | females$year==02 ) &
          (females$agefrom >= 40 & females$agefrom <= 80 ) ,
                     ] ; 
ds[ds$year==71,"yrs.since71"] = 0 ;
ds[ds$year== 2,"yrs.since71"] = 31 ; ds

If really want to, can extract each year separately 

The code above (with the | for 'or' ) extracts both

To extract one, use e.g. 

ds71 = females[

          (females$year==71  ) &
          (females$agefrom >= 40 & females$agefrom <= 80 ) ,
                     ] ;

Now you can



You don't have to do it the way I did..

You can make a dadaset for 2002

Then make a new column with the average of the 2 population numbers for the direct standardization etc. and apply the observed rates to the common population numbers.


My first questions are regarding question 1.

1 )

When I compare the expected rate of my multiplicative model to my additive model for a specific category, let's say the exposed group in the 60-69 age category, the estimated rates are slightly different.

Is this to be expected or should they be the same?

In any regression situation (eg even in 1st part of 621, with continuous y's), if you fit two different models, the fitted values will be different eg

E[y= body surface area]  = height to some power P 


E [log body weight ] =  B0 + B1 x height


log { E [body weight ] } =  C0 + c1 x height

they are all different models, so expect different fitted coefficients and different fitted y values

Clayton and Hill write: "[...] but because the estimated values of the parameters in the additive model must be restricted so that they predict positive rates." Is the imposed restriction the cause of the observed difference in rates?

Sometimes, as they say, you get in trouble trying to fit an 'additive rate' model, but here you can actually fit one that gives positive rates in every cell.. so no, that is not the reason..

say you had data where
X = concentration of alcohol in a drink

y = 1 you can tell that the drink has alcohol , 0 if not 

you might well be able to fit a binomial model with identity link

ie prob[y=1] = B0 + B1 X

especially if the data were such that the observed percentage who could tell varied from say 30% when there was 2% alcohol and 60% when the concentration was 5%

you could also fit a logistic curve to this, but now the percentage can tell versus concentration curve is S shaped , not linear.. so it is a different fit

the difficulty with keeping the percentages within the 0-100 range is more of an issue if you have only a little data, and they are down near 0 or up near 100, so that there is a danger in the linear model that the line will go below 0 or above 100

2) When I compare the coefficients in the categorical model vs the continuous models, I find all coefficients in the categorical model to correspond, in some way, to the coefficients in the continuous models. However, the coefficients of the categorical model always end up with a slightly higher value than those in the continuous models.

Why is that?

They have very different meanings, and the magnitude in the case of the continuous version is scale-dependent.

the exponents of the beta_hats for the indicator variables for the categories refer to the rate ratios for the categories in question versus the ref category

the exponent of the single beta_hat for the continuous variable refers to the rate ratio for persons with a value of 1 versus persons with 0, or 3 vs 2, or 11 versus 10, or 36 versus 35, ie for persons with a value of x+1 versus those with x.

IF the rate ratios for the categories have a perfect log-linear trend

1 in ref (0) category
1.5 in next  so beta_hat_1 = 1og RR  = 0.405
2.25 in next so beta_hat_2 = log RR  = 0.81,
3.375 in next so log = 1.21,
and if the categories were say 10 year categories, then the beta_hat from a continuous age model should, when contrasting persons 10 years apart in age, be 0.405 ie the beta_hat contrasting persons 1 year apart in age would be 0.0405

(or course it would be smaller if you meassured age in months, or larger if in decades.

if of course you had say a U shaped relation, eg x = maternal age vs rate of low birthweight or some such, then if you used the youngest as ref category, your beta_hats for the age categories after that would first be negative, and then come back towards 0 The beta_hat for continuous age might well be zero since a straight line fitted to a U is a flat horizontal line

You will note that everything I am saying also applies to the regression where y is continuous as in 1st 1/2 of 621, ie substitute blood pressure for log rate and treat age as categorical versus continuous ... your question is bigger than a rate regression question.


In Q.2 part iii) Is the intercept of the following glm output a 'corner'?

glm(dac~as.factor(agefrom) + as.factor(yrs.since71),family=poisson,offset=log(pop),data=ds)
I think of 'corner' as just another way of saying intercept (it's a bit easier to explain to lay people)

And 'corner' or 'intercept' doesn't have to be at zero, ie the youngest age category 40-45, and the year 1971 can be used: all you are doing is establishing the reference categories, the ones represented by zeros

But you do have to think about what else is in the model. An intercept changes with where the x's are centered or start from :-

ideal weight = 100 lbs + 5 lbs for every 1" over 5' of height (I think of 100 as the intercept)


ideal weight = -200 lbs + 5 lbs for every 1" of height above the soles of your feet.


I have been working on question #3 of the assignment and was able to get parts i and ii, but I am having problems with part iii in terms of getting the "cornering" into my R code. I looked up the references by Clayton and Hills from question #2 but, still cannot piece it together.

Rate = Corner x Exposure (Gender) x Age does not make that much sense to me. I cannot flip it to how we have seen regression in 607 and 621.

I take it you have done/seen part ii of question number 1.. the code for that is on

[ We fitted this model in the last class... ]

part iii of this problem 3 is very similar

'corner' (in general) refers to the (rate, or log rate, or mean, or whatever, in the group to whom the beta_0 refers

eg ideal weight 

100 lb + 5 lbs for every inch above 5' + 10 extra lbs if male + (if male) 1 extra lb for every inch above 5' 

corner is the 100 lb for 5' females 

so in a model

   glm(cases~x1+x2 + ...,family=poisson,offset=log(pt))

you are fitting 

log{E[#cases]} = linear predictor           + offset 

                = B0 + B1.X1 + B2.X2 + ..    + log(pt)


E[#cases]} = exp[ B0 + B1.X1 + B2.X2 + ..    + log(pt) ]

           = { exp[ B0] x exp[B1.X1] x exp[B1.X1]  ....   x  pt

divide by pt

E[#cases/pt]} = exp[ B0] x exp[B1.X1] + exp[B1.X1]  .... 


Expected[rate] = corner rate x multiplier associated with X1 x multiplier associated with X2   ...  etc

above, I have worked back from what the R model is to the epi model is

In the summary 1/2 page I handed out on feb22, i worked forward from the epi model to the R model ..

some people missed the feb 22 class..

( the summary is also on the website as the Notes for feb 22)


A quick question regarding the R code that you gave us: What exactly is the ~1 doing in this model?

glm(cases~-1 + pt+ e.pt + ... , family=poisson(link="identity"))

I explained it in class, but you were absent.

In a regression with an intercept B0 and slope B1, you effectively have

B0 x 1 + B1 x X1 + ..

So the computer program sets up a matrix in which the first column is all 1's, the next column is the column of the X1 values, etc... technically then, the model is B0 x 1 + B1 x X1 + ..

If you wish to have a model with NO intercept, then ask the package to remove the column of 1's, and accompanying B0, then ask for the model without a columns of 1's

The way the programmers ask you to do this is to specify the model as -1 + X1 + X2 ...

AND, the reason you don't want an intercept (ie that you want the intercept to be zero) is that when there is zero person-time, there are zero cases.

Assignment 3


I would like to have some clarifications about Question 3 of Assignment. I understand the part where you ask us to rewrite the OR. However, you also asked us to explain why PT1 and PT0 do not appear in eqn 7-5, whereas their counterparts c and d do in eqn 7-6. And I guess you really referred to the equations of SE (and not eqn 8-5 and 8-6). I guess that because you can measure the person-time directly in a cohort study, this does not contribute to your "uncertainty" where as in a case-control study, you have to account for the fact that controls are used. Am I on the right track? How do you demonstrate that mathematically?

Yes, on re-reading it, I DO mean eqns 7-5 and 7-6. And the question has to do with why the variances looks somewhat similar, but the variance in 7-6 has 2 extra terms beyond what is in 7-5.

You write about one as having DIRECT measures of the PT denominators, and so are on the right track. I would be a bit more precise and say one has measures of the PT denominators that are NOT SUBJECT TO SAMPLING ERRORS. They may have other issues, such as computational and programming errors, but we have no way of quantifying these, and so assume that when the PT1 and PT0 are calculated as 1267.3 and 2510.6 person-years, that the true values are indeed 1267.3 and 2510.6, obtained by actually (correctly) DOCUMENTING EVERY person-moment.

By the way, the PT1 and PT0 do NOT have to come from a 'COHORT' study; they could have come from an open population where people (not necessarily even the same people each day) are switching back and forth from exposed to unexposed, as when driving on and off the phone, or on and off the extended shift. The intern-study is an example of an open population, rather than a cohort (no matter what the nejm says -- see my notes at first class..)
Population: An aggregate of people, defined by membership-defining... (a) event -> "cohort" ( closed population i.e. closed for exit) OR (b) state - for duration of state -> Open population (open for exit) / dynamic / turnover

Now imagine ESTIMATING the PT1:PT0 ratio without actually documenting every person-moment, e.g. by sitting on a bridge/side of road and watching who is driving on/off the phone. This would have other measurement problems, but even without them, we would have a sampling problem.. we can't watch every car everywhere everytime, and so we would have to sample, and so our PT's should have hats on them !!

for example, we might find that in 10 observations, 2 were on the phone.. the 2:8 split of PT is clearly imprecise, and it would be better to do 100 or 1000 or 10000! the more we do, the better out estimate of the PT1:PT0 ratio.

I like to think of the 1000 as 1000 person-moments.. a person moment being a particular person at a particular time.. of course there are an infinite number of such moments as we use smaller and smaller time intervals.

I like to think of the 1000 as a denominator series rather than a control series, and so my c=115 : d=885 (say) is an estimate of PT1:PT0. I don't think of the 115 as people, but as people at a particular moment. and so I prefer to think of the classic 2x2 table is the so-called 'case-control' study as really

'a' = c1 events in an estimated denominator PT1-hat = 'c' person-moments x some large blow-up factor

'b' = c0 events in an estimated denominator PT0-hat = 'd' person-moments x SAME large blow-up factor (which cancels)

so rate ratio estimate = (c1/PT1-hat) / (c0/PT0-hat) = ('a'/'c') / ('b'/'d') in old-fashioned epi.

so.. the long answer to a short question:

where as in a study that estimates the (relative) sizes of the denominators ***using the controls series in a case-control study*** you have to pay a price for using estimated denominators rather than known ones ***account for the fact that controls are used**.

How do you demonstrate that mathematically? It seems to me that if you estimated time on and off the phone using billing records, without having to sample, your so called 'c' and 'd' would be ALL of the person moments .. and the moments would be person-seconds if the companies bill in seconds or minutes or whatever!! but they would be HUGE and so 1/'c' + 1/'d' would be EFFECTIVELY 1/infinity + 1/infinity i.e., EFFECTIVELY zero! ie 1/a + 1/b + 1/c + 1/d = 1/a + 1/b.

Assignment 2

I am kind of lost as many others. It would be great if we can focus on one thing for a week, so that we can understand it. You might have to pick ONLY those few things which are important from an Epidemiology master's student point of view. We are not learning anything like this. I hope you understand.


I have a question about Q. 4 part ii. In the Wilson formula on page 132 of Rothman, do I use N=1084 and a=6, as there are 6 people with infection and 1084 w-y at risk?


I am a little confused about question 2. I arrived at the CI in part (i), namely by using a probability of 0.8 x 10(-6), and n of 3 million odd. However, since the probability is expressed as 0.8 per 100,000 persons (call it a unit), and there are 33 such units in the total population, could I not have used the binomial setting, pie= .8, n=33. I do, but the CI I get when I do this is much smaller. (approx, 0.66 to 0.94). Perhaps this is because the probability is close to the extremes of the probability range and the number of trials too small. Does that make any sense?

1 OBSERVED positive in a sample of 3 has a very different precision than 10 OBSERVED positives in a sample of 30. when calculating CIs, you have to use integer (real) persons; you cannot statistically shrink them or expand them. Once you are done with the proportion and its CI, you can re-express them any way you want.

I have tried to access material from your other courses on your webpage (such as EPIB 607), but the M-H password doesn't give me access. Can you help? Thanks

if you enter from my home page, you will see "For all but EPIB634, login as campus\DASusername & use your McGill pw "

Inside the link for Prevalence, hazard/(incidence)rate, risk/cumulative incidence, i am more explicit

Note: for links followed by # , login to (med) server using as your username campus\your_DAS_username , as your password your McGill email password

If you don't know you das_username, go to http://www.mcgill.ca/reggie/ -> View Account Information * DAS account For Username, type in McGill email address; type your McGill email password ===> Your DAS_username is shown as Account name.

Assignment 1


I have a quick question about the confidence interval on the IDR in question 4(ii), safe-pilots. Because it is a ratio, I do not have a 'count' to plug in the Poisson CI spread sheet. Is that correct. I am not able to calculate the confidence interval, hence.

1. IDR implies comparison of 2 IDs. so it's not a 1-sample problem, and so the Poisson spread sheet is not relevant. 2. As Q4(ii) asks, see Rothman ch 7 for CIs for IDRs (he calls them Incidence Ratios [IRs] -- he omits the word density)


I was reading the Miettinen paper. While I understood the last two columns of the table-1 after reading the original formulas from his paper, I do not understand why he multiplies by 5-years in the foot note to calculate 30-year risk. Please explain.

First, let's distinguish 2 cases (a) where 30-year risk will be of the order of a few percent, as with disease-free people developing (or dying from) a specific cancer or mode of death (Rothman2002 table 3.3 illustrates with motor-vehicle injury) and (b) a risk of 10% or even higher, eg if considering all causes from midlife onwards, or people with an existing or newly dx'ed serious disease such as colon or breast cancer, or chf, or AD.

For (b) we have no choice but to use the full eqn with the exponential of the integral or sum. For (a), which applies in the bladder cancer risk example, the ultimate risk will be small and so we can use the simplification : CI = integral. Rothman2002, (starting with "Because the overall risk ...") 1/3rd the way down p 38, goes the integral route, and skips the exponential, because 1 - exp[-0.016] is 0.0159, ie very close to 0.0160. BUt what he is doing with his sum of products is finding the area under the ID function.

Why not sketch the ID function? it goes in steps .. and so the area under each piece is ID x the width of the interval. the first interval is 15 years, so the area is (4.7/100000) x 15, etc etc..

Another way to see why one multiplies by time is to consider his equation 3-1 a few pages earlier.

He does warn ("Clearly, ..." p36) that one can't use eqn. 3-1 if the risk is high.. the example i gave in class, where the ID of death for people with pancreas cancer is 2 deaths per 1-patient year at risk. because the average survival is 6 mo. it means that if a doctor would only follow 1 pancreas patient at a time, and take on a new one as soon as that one died, etc etc, (s)he would on average lose 2 patients a year. The chance to go a whole year without a loss is therefore exp[-2] = 0.14 or 14% [the Poisson probability of 0 events when the expected number is 2], so the 1-year risk is 1 (or 100) minus this, i.e., 0.86 or 86%.

The other place where it is clear that one has to be careful is with the needle stick injuries. If the ID is 0.09injuries/Intern-Month, we cannot say that the 12 month risk is 1.08. What the 1.08 means is that if we had started with one intern, and as soon as (s)he got injured, replaced him/her by another, etc, so that we had one intern-year of continuous coverage, we would have an average of 1.08 injuries per year. Now use the Poisson formula with mu=1.08 and work out the prob of no (zero) injuries in an intern-year's worth of continuous "up-time" ie. experience .. ie

      Prob[0 injuries] = Prob[the first intern still with you, ininjured] = exp[-1.08] ... the 1.08 is 0.09 x 12 or ID x timespan.

if the ID changes over the span, divide up the span and add the separate products to get mu the expected no. of injuries, then take exp[- mu] to get the prob that 1 intern would remain injury-free, and 1 minus that is the 12 month risk.

I do not understand what question 5i) is asking. Where are the 30-year risk of bladder cancer, calculated by Miettinen, in the article? Are you refering to the exposure-specific incidence density?

It's on page 23 of the "31 examples" handout from the first class. "Turning to the assessment of risk, consider the 30-year risk of bladder cancer for a 50-year-old man, assuming that without bladder cancer he would survive that period. If the man is a smoker, then the estimate is...

The calculations are done separately for the smokers and non-smokers. ie exposure-specific Risks.

PS: I have excerpted below from section 4 of his article (which I have put online in the Resources for Prevalence, hazard/(incidence)rate, risk/cumulative incidence) and added a comment that may link this "mathematical CI" to other mathematical quantities calculated from (effectively) cross-sectional data.
4.1. The parameters. Even though the source population of subjects tends to constitute a (dynamically) static group in each category of age (figure 1), with new people continually entering it at the lower bound (and within the range) of age and others exiting it (within the range and) at the upper bound, the data from successive age categories may be used to make inferences about an aging cohort of fixed membership and homogeneous, though continually increasing, age. With regard to such an age-cohort, the interest is, firstly in the cumulative incidence of the group as it passes from one age to another, and, secondly, in the corresponding risks of its individual members.
JH COMMENT: This is similar to how people calculate life expectancy (and other measures) from cross-sectional data. After all, Miettinen was using IDs from a narrow 18-month calendar time window near 1970, and any projections about what is in store over the next 30 years for men 50 years olds in 1970 is a projection at best ... just as StatsCan used period ("current") death rates from an open population) to project what would happen to a fictitious cohort if they experienced these age-specific IDs over their lifetime. Miettinen makes a distiction between CI and Risk..
The cumulative incidence-rate for a span of age is the proportion of the group developing the illness in that period, while the risk for an individual is the probability of his developing the illness in the particular span of age.
When jh hears that the CI of prostate cancer over the age span 60 to 90 is x%, he immediately takes that as his own probability as well.. i.e., he doesnt make the 'group versus individual' distinction that Miettinen does.

I understand why the multiplier time is there in calculating cumulative risk. However, I do not understand why Miettinen multiplies it by 5 years to calculate 30-year risk and does not multiply by 30 years?

if you look at Rothman2002 p38 example, you will see that the time multipliers add to 85 -- the time span covered.

if you look at Miettinen's, he had 34/10^5 multiplied by 5 years, then 54/10^5 by 5 years etc until 260 multiplied by 5 years, those 6 multiplications, each involving 5 years, covers the 30 year span.

Rothman could have saved some calculator strokes by first adding the least 3 IDs and then multiplying their sum by 20 years .. ie Ax15 + Bx10 + Cx20 + Dx20 + Ex20 = Ax15 + Bx10 + (C + D + E)x20

Since Miettenen was probably also use a hand calculator, and all time bands were the same width, it was easier to add the 6 IDs and then multiply their sum by 5 .. either way, Miettinen has covered 30 years.