Assignment 4, 2010 .. questions 1 and 3

To: James Hanley, Dr.

Subject: EPIB 634 Asst 4

Hey,

We have a question in regards to Q3ii. We have attempted to calculate the person time for everolimus and the number of people in which the primary endpoint occurred. The number of events at 10 days is not given, is this supposed to be read off of figure 2a.

I wanted people to use the raw (patient-level) data on the website (sorry if that was not clear.. but then people did use these data for the earlier K-M assignment.

ID for interval 1-10 is events in that period divided by p-days in that interval

computer file has the events, so if you just sort the time you will see where tey are (or use fancier calculations)

ROUGHLY

p-t in days 11-360 is sum of all the days from 1-10 MINUS the days in the first interval)

days in first interval

= 10 for each patient who gets past day 10

=`day' for those who have event <= 10

FANCIER

# via R code

library(survival)

# add 1 to all days, so no failures at exactly t=0

# (most survival analysis programs cannot handle [ie will exclude!]

# events that coincide with entry)

ds$day = ds$day+ 1

# to track the time-slices for each person, give each patient an ID

ds$ID=1:length(ds$day)

# cutpoints to create FU intervals for use in survSplit function

cutpoints = c(0.5, 10.5) # for 2 intervals 0-10 and 10-360

split=survSplit(ds,cut=cutpoints,end="day",event="event",start="start.day")

split$pt.days = split$day- split$start.day

head(split)

tail(split)

(actually I extracted them from the image, but by a sneaky way I will explain on Thursday)

To: James Hanley, Dr.

Subject: Question regarding the assingment for Thursday

Professor Hanley,

In question 1, part i of the assignment, we wanted to see if we are on the right track:

In calculating the five-year risk, we used the fifth year incidence density (despite the fact that the ID changes between time intervals) and multiplied this number by five (5 years) to estimate # of events over five year period (mu = 0.025 events per year, over five years, we would have 0.125 events per 5 year period).

the ID is lower early on, since people at YOUNGER but why take worst rate and use it for 5 yrs?

NOT SURE WHY YOU took approach you did when I said you should get the integral of the ID fn. It serves as the Expected no. of events if 1 person at risk every day for 5 years (can replace person with another who is out the same time distance from start of tx)

mu= 1.17 (12 mo) ; what is exp(-mu)

try mu= 0.01, 0.1, 1, 2, 5 and calculate exp(-mu) for each and compare with mu

see what you learn..

Assignment 4, 2008

Dr.Hanley hello

I have frustrated myself trying to figure out how to do question 2i of the assignment: I can't make the command work for summing all 'dac' limited to year 71 AND also limited to ages 40-85. ....so that after having this sum and that for the Person years, I can calculate the crude all cause mortality rate..

will you please let me know how to do this? I know i can manually sum it up to come up with crude rates but I would really like to know the R code.

much appreciated

The code is in the Resources opposite the assignment and the topic

http://132.206.244.30/~jameshanley/c634/rates/q_fm_mortality_R.txt One way is to extract the data fist, as is done there in the code females=read.table("q_f_mortality_data.txt",header=TRUE) ds = females[ (females$year==71 | females$year==02 ) & (females$agefrom >= 40 & females$agefrom <= 80 ) , c(2,3,4,7,8) ] ; ds[ds$year==71,"yrs.since71"] = 0 ; ds[ds$year== 2,"yrs.since71"] = 31 ; ds If really want to, can extract each year separately The code above (with the | for 'or' ) extracts both To extract one, use e.g. ds71 = females[ (females$year==71 ) & (females$agefrom >= 40 & females$agefrom <= 80 ) , c(2,3,4,7,8) ] ; Now you can sum(ds71$dac) EtcYou don't have to do it the way I did..

You can make a dadaset for 2002

Then make a new column with the average of the 2 population numbers for the direct standardization etc. and apply the observed rates to the common population numbers.

1 )

When I compare the expected rate of my multiplicative model to my additive model for a specific category, let's say the exposed group in the 60-69 age category, the estimated rates are slightly different.

Is this to be expected or should they be the same?

E[y= body surface area] = height to some power P and E [log body weight ] = B0 + B1 x height and log { E [body weight ] } = C0 + c1 x heightthey are all different models, so expect different fitted coefficients and different fitted y values

say you had data where

X = concentration of alcohol in a drink y = 1 you can tell that the drink has alcohol , 0 if not you might well be able to fit a binomial model with identity link ie prob[y=1] = B0 + B1 Xespecially if the data were such that the observed percentage who could tell varied from say 30% when there was 2% alcohol and 60% when the concentration was 5%

you could also fit a logistic curve to this, but now the percentage can tell versus concentration curve is S shaped , not linear.. so it is a different fit

the difficulty with keeping the percentages within the 0-100 range is more of an issue if you have only a little data, and they are down near 0 or up near 100, so that there is a danger in the linear model that the line will go below 0 or above 100

Why is that?

the exponents of the beta_hats for the indicator variables for the categories refer to the rate ratios for the categories in question versus the ref category

the exponent of the single beta_hat for the continuous variable refers to the rate ratio for persons with a value of 1 versus persons with 0, or 3 vs 2, or 11 versus 10, or 36 versus 35, ie for persons with a value of x+1 versus those with x.

IF the rate ratios for the categories have a perfect log-linear trend

eg

1 in ref (0) category 1.5 in next so beta_hat_1 = 1og RR = 0.405 2.25 in next so beta_hat_2 = log RR = 0.81, 3.375 in next so log = 1.21, etcand if the categories were say 10 year categories, then the beta_hat from a continuous age model should, when contrasting persons 10 years apart in age, be 0.405 ie the beta_hat contrasting persons 1 year apart in age would be 0.0405

(or course it would be smaller if you meassured age in months, or larger if in decades.

if of course you had say a U shaped relation, eg x = maternal age vs rate of low birthweight or some such, then if you used the youngest as ref category, your beta_hats for the age categories after that would first be negative, and then come back towards 0 The beta_hat for continuous age might well be zero since a straight line fitted to a U is a flat horizontal line

You will note that everything I am saying also applies to the regression where y is continuous as in 1st 1/2 of 621, ie substitute blood pressure for log rate and treat age as categorical versus continuous ... your question is bigger than a rate regression question.

glm(dac~as.factor(agefrom) + as.factor(yrs.since71),family=poisson,offset=log(pop),data=ds)I think of 'corner' as just another way of saying intercept (it's a bit easier to explain to lay people)

And 'corner' or 'intercept' doesn't have to be at zero, ie the youngest age category 40-45, and the year 1971 can be used: all you are doing is establishing the reference categories, the ones represented by zeros

But you do have to think about what else is in the model. An intercept changes with where the x's are centered or start from :-

ideal weight = 100 lbs + 5 lbs for every 1" over 5' of height (I think of 100 as the intercept)

or

ideal weight = -200 lbs + 5 lbs for every 1" of height above the soles of your feet.

Rate = Corner x Exposure (Gender) x Age does not make that much sense to me. I cannot flip it to how we have seen regression in 607 and 621.

http://132.206.244.30/~jameshanley/c634/rates/ClaytonHills-example22-6-Rcode.txt

[ We fitted this model in the last class... ]

part iii of this problem 3 is very similar

'corner' (in general) refers to the (rate, or log rate, or mean, or whatever, in the group to whom the beta_0 refers

eg ideal weight 100 lb + 5 lbs for every inch above 5' + 10 extra lbs if male + (if male) 1 extra lb for every inch above 5' corner is the 100 lb for 5' females so in a model glm(cases~x1+x2 + ...,family=poisson,offset=log(pt)) you are fitting log{E[#cases]} = linear predictor + offset = B0 + B1.X1 + B2.X2 + .. + log(pt) back-translate E[#cases]} = exp[ B0 + B1.X1 + B2.X2 + .. + log(pt) ] = { exp[ B0] x exp[B1.X1] x exp[B1.X1] .... x pt divide by pt E[#cases/pt]} = exp[ B0] x exp[B1.X1] + exp[B1.X1] .... ie Expected[rate] = corner rate x multiplier associated with X1 x multiplier associated with X2 ... etcabove, I have worked back from what the R model is to the epi model is

In the summary 1/2 page I handed out on feb22, i worked forward from the epi model to the R model ..

some people missed the feb 22 class..

( the summary is also on the website as the Notes for feb 22)

glm(cases~-1 + pt+ e.pt + ... , family=poisson(link="identity"))

In a regression with an intercept B0 and slope B1, you effectively have

B0 x 1 + B1 x X1 + ..

So the computer program sets up a matrix in which the first column is all 1's, the next column is the column of the X1 values, etc... technically then, the model is B0 x 1 + B1 x X1 + ..

If you wish to have a model with NO intercept, then ask the package to remove the column of 1's, and accompanying B0, then ask for the model without a columns of 1's

The way the programmers ask you to do this is to specify the model as -1 + X1 + X2 ...

AND, the reason you don't want an intercept (ie that you want the intercept to be zero) is that when there is zero person-time, there are zero cases.

Assignment 3

Yes, on re-reading it, I DO mean eqns 7-5 and 7-6. And the question has to do with why the variances looks somewhat similar, but the variance in 7-6 has 2 extra terms beyond what is in 7-5.

You write about one as having DIRECT measures of the PT denominators, and so are on the right track. I would be a bit more precise and say one has measures of the PT denominators that are NOT SUBJECT TO SAMPLING ERRORS. They may have other issues, such as computational and programming errors, but we have no way of quantifying these, and so assume that when the PT1 and PT0 are calculated as 1267.3 and 2510.6 person-years, that the true values are indeed 1267.3 and 2510.6, obtained by actually (correctly) DOCUMENTING EVERY person-moment.

Population: An aggregate of people, defined by membership-defining... (a) event -> "cohort" ( closed population i.e. closed for exit) OR (b) state - for duration of state -> Open population (open for exit) / dynamic / turnover

Now imagine ESTIMATING the PT1:PT0 ratio without actually documenting every person-moment, e.g. by sitting on a bridge/side of road and watching who is driving on/off the phone. This would have other measurement problems, but even without them, we would have a sampling problem.. we can't watch every car everywhere everytime, and so we would have to sample, and so our PT's should have hats on them !!

for example, we might find that in 10 observations, 2 were on the phone.. the 2:8 split of PT is clearly imprecise, and it would be better to do 100 or 1000 or 10000! the more we do, the better out estimate of the PT1:PT0 ratio.

I like to think of the 1000 as 1000 person-moments.. a person moment being a particular person at a particular time.. of course there are an infinite number of such moments as we use smaller and smaller time intervals.

I like to think of the 1000 as a denominator series rather than a control series, and so my c=115 : d=885 (say) is an estimate of PT1:PT0. I don't think of the 115 as people, but as people at a particular moment. and so I prefer to think of the classic 2x2 table is the so-called 'case-control' study as really

'a' = c1 events in an estimated denominator PT1-hat = 'c' person-moments x some large blow-up factor

'b' = c0 events in an estimated denominator PT0-hat = 'd' person-moments x SAME large blow-up factor (which cancels)

so rate ratio estimate = (c1/PT1-hat) / (c0/PT0-hat) = ('a'/'c') / ('b'/'d') in old-fashioned epi.

where as in a study that estimates the (relative) sizes of the denominators ***using the controls series in a case-control study*** you have to pay a price for using estimated denominators rather than known ones ***account for the fact that controls are used**.

Assignment 2

Inside the link for Prevalence, hazard/(incidence)rate, risk/cumulative incidence, i am more explicit

Note: for links followed by # , login to (med) server using as your username campus\your_DAS_username , as your password your McGill email password

If you don't know you das_username, go to http://www.mcgill.ca/reggie/ -> View Account Information * DAS account For Username, type in McGill email address; type your McGill email password ===> Your DAS_username is shown as Account name.

Assignment 1

For (b) we have no choice but to use the full eqn with the exponential of the integral or sum. For (a), which applies in the bladder cancer risk example, the ultimate risk will be small and so we can use the simplification : CI = integral. Rothman2002, (starting with "Because the overall risk ...") 1/3rd the way down p 38, goes the integral route, and skips the exponential, because 1 - exp[-0.016] is 0.0159, ie very close to 0.0160. BUt what he is doing with his sum of products is finding the area under the ID function.

Why not sketch the ID function? it goes in steps .. and so the area under each piece is ID x the width of the interval. the first interval is 15 years, so the area is (4.7/100000) x 15, etc etc..

Another way to see why one multiplies by time is to consider his equation 3-1 a few pages earlier.

He does warn ("Clearly, ..." p36) that one can't use eqn. 3-1 if the risk is high.. the example i gave in class, where the ID of death for people with pancreas cancer is 2 deaths per 1-patient year at risk. because the average survival is 6 mo. it means that if a doctor would only follow 1 pancreas patient at a time, and take on a new one as soon as that one died, etc etc, (s)he would on average lose 2 patients a year. The chance to go a whole year without a loss is therefore exp[-2] = 0.14 or 14% [the Poisson probability of 0 events when the expected number is 2], so the 1-year risk is 1 (or 100) minus this, i.e., 0.86 or 86%.

The other place where it is clear that one has to be careful is with the needle stick injuries. If the ID is 0.09injuries/Intern-Month, we cannot say that the 12 month risk is 1.08. What the 1.08 means is that if we had started with one intern, and as soon as (s)he got injured, replaced him/her by another, etc, so that we had one intern-year of continuous coverage, we would have an average of 1.08 injuries per year. Now use the Poisson formula with mu=1.08 and work out the prob of no (zero) injuries in an intern-year's worth of continuous "up-time" ie. experience .. ie

Prob[0 injuries] = Prob[the first intern still with you, ininjured] = exp[-1.08] ... the 1.08 is 0.09 x 12 or ID x timespan.

if the ID changes over the span, divide up the span and add the separate products to get mu the expected no. of injuries, then take exp[- mu] to get the prob that 1 intern would remain injury-free, and 1 minus that is the 12 month risk.

The calculations are done separately for the smokers and non-smokers. ie exposure-specific Risks.

------------------------

PS: I have excerpted below from section 4 of his article (which I have put online in the Resources for Prevalence, hazard/(incidence)rate, risk/cumulative incidence) and added a comment that may link this "mathematical CI" to other mathematical quantities calculated from (effectively) cross-sectional data.

4.1. The parameters. Even though the source population of subjects tends to constitute a (dynamically) static group in each category of age (figure 1), with new people continually entering it at the lower bound (and within the range) of age and others exiting it (within the range and) at the upper bound, the data from successive age categories may be used to make inferences about an aging cohort of fixed membership and homogeneous, though continually increasing, age. With regard to such an age-cohort, the interest is, firstly in the cumulative incidence of the group as it passes from one age to another, and, secondly, in the corresponding risks of its individual members.JH COMMENT: This is similar to how people calculate life expectancy (and other measures) from cross-sectional data. After all, Miettinen was using IDs from a narrow 18-month calendar time window near 1970, and any projections about what is in store over the next 30 years for men 50 years olds in 1970 is a projection at best ... just as StatsCan used period ("current") death rates from an open population) to project what would happen to a fictitious cohort if they experienced these age-specific IDs over their lifetime. Miettinen makes a distiction between CI and Risk..

The cumulative incidence-rate for a span of age is the proportion of the group developing the illness in that period, while the risk for an individual is the probability of his developing the illness in the particular span of age.When jh hears that the CI of prostate cancer over the age span 60 to 90 is x%, he immediately takes that as his own probability as well.. i.e., he doesnt make the 'group versus individual' distinction that Miettinen does.

if you look at Miettinen's, he had 34/10^5 multiplied by 5 years, then 54/10^5 by 5 years etc until 260 multiplied by 5 years, those 6 multiplications, each involving 5 years, covers the 30 year span.

Rothman could have saved some calculator strokes by first adding the least 3 IDs and then multiplying their sum by 20 years .. ie Ax15 + Bx10 + Cx20 + Dx20 + Ex20 = Ax15 + Bx10 + (C + D + E)x20

Since Miettenen was probably also use a hand calculator, and all time bands were the same width, it was easier to add the 6 IDs and then multiply their sum by 5 .. either way, Miettinen has covered 30 years.