McGill University, Department of Epidemiology
and and Biostatistics
EPIB681: Data Analysis II (Winter 2004)
Frequently Asked Questions (FAQs)
Question / Suggestion
RESPONSE 

Walker (pg 109 Observation and Inference) calculates the RR in a poisson distribution
by a different formula than the one seen in JH notes.
He defines the Var (lnRR)= 1/E+cases + 1/Ecases however, I don't find the same
answers when I apply this formula to the exampples from your notes
on right hand column of p 29 of Poisson notes, i give the same formula .... but
with just different notation...
(I say its is like "Woolf" with the denominators so large that 1/denom
is zero )
var [ ln rr ] = 1/c1 + 1 c0 ]
c = # cases
subscript 0 = E ; subscript 1 = E+
Is p29 where you were looking? or in some other notes of mine?
would you accept this approach?
sure .. since I think we are both taking the same approach
also, presumably, when you say RR in a poisson distribution
you refer to TWO poisson variables, one for exposed, one not 
Also, here, when I convert the logit to proportion for year_87 (beta=0.0032), I
get pi=0.5.
1. a beta is not a logit.. is is a DIFFERENCE in logits.. !!
the only beta that is a logit is the beta_0
so exp[beta] is the increase/decrease on odds
How can the probability of injury increase by 0.5 for every unit of time (1year)
when pi has to be
between 0 and 1?
exp[0.0032] is what is relevant.. and it is applied to (acts on) the odds...
exp[0.0032]/(1 + this) doesn't make sense here
exp[4.8089]/(1 + this) does ..
2) Question 4, part I. Do we need to use proportions from this example
no.. you need to head higher !! they don't diverge enough at the proportions
we are studying, do they?
(by the way.. am treating injury rates as 'proportions' here for now ...... will
eventually move on to regression for rates.. but not now.. )
or can we compare the log and logit of ficticious proprotions (i.e. ranging from
say 0.001 to 0.1)?
yes.. am curious where (without calculations) you think it will be ?
likewise, at what point do you consider we are in the 'rare disease' territory? [somewhat
related... ]

would you mind telling me the importance and the meaning of performing a chisquare
test on the strata specific ln[OR] ?
i wouldn't mind at all...
are you talking about the last section in chapter , where the emphasis is on a few
2 x 2 tables..
are they talking a test for homogeneity of ORs ?
it would be like asking if the effect is same across strata ? ie a test of effect
modification..
ie are odds ratios same for different folks?
The H&L presents that as a test for homogeneity but do not explain the consequences
of it.
if not homogeneous, and if very different then answer re effect is 'it depends'
ie not a 'one OR for all' story..
and so need to report or's separately for each level of modifier.
Does this mean that I can not use a pooled unconfounded OR?
if ORs are very different (beyond random variation) what would be the purpose
of an 'average' OR?
a bit like saying the average person has one testicle and one ovary!
don't get into confounding.. since cant say JUST from test of homogeneity if there
is or not (see next q)
Does this mean that there is confounding??
If i tell you that the or in males is 2.0 and in females 2.0 can you tell just from
that whether there is confounding?
NO .. all depends on how exposure and gender are distributed in the sample
AND on whether gender is even a risk factor in 1st place.. 
when comparing univariate with multiple regression models H&L states that "in
MR there will be p+1 likelihood EQUATIONS "...what does it mean by that? I thought
the equation was the same, just more coefficients were included....
they are talking about ESTIMATING equations (the ones to find the ML estimates..)
remember we tried to manually (by trial & error) maxinize a likelihood
another way would be to find the tangent (derivative) and solve to see where it is
zero
that is in "1D" because in my very simple example, just 1 parameter
same principle in "p+1"D if maximizing over p+1 parameters
for example, where is the max in p 6 of my 2D likelihood in Intro to lr notes ..
need t watch tangents in northsouth (B1) and eastwest (B0) directions
by the way, how many estimating equations do you think are involved in fitting the
LS (least sq) estimates in c621.... same #: p+1 !!!!
what is the number of the degrees of freedom I should use in the log likelihood ratio?
Is it k1? (k would mean all the coeff, including the intercept)
what exactly is the like RATIO ?
ratio of WHAT to WHAT ??
if you look at the WHAT and WHAT, the no. of extra parameters will be obvious
do I have to know how to set up the dummy variables by this oher method "deviation
from the means code"?
I said that we prefer reference coding BUT you should be aware that there are
others and that you need to know what is default in the software you use.. eg have
you looked into how stata handles categorical variables and how it does coding (if
indeed it does_ ?
we learned chisquare for comparison of counts in tables and it was clear that each
cell should have a minimum of 5 individuals in order to the chi square to "work"
well. Now, how does this 'concept" translate in the LR?
DOESN'T ....
a distribution (like chisq or F) can serve MANY different purposes..
eg can use an F test to test if two variances are equal
can also use it to test if 2 or more means are equal (ANOVA)
and you didn't object to multiple uses then!
Also, in my notes on ch 2 i wrote
In the limit, with a large n, under the null , 2 times the F[2, n6] statistic has
a distribution close to a chisquare distribution with 2 degrees of freedom  i.e.
the 2 tests are the same, With binary Y's, we don't spend degrees of freedom estimating
a separate variance parameter for the variation of Y around its expectation we
assume it is Bernoulli.. i.e. it is like having an infinite number of degrees of
freedom in the denominator of the (partial) F statistic  thereby making it like
a chisquare statistic.
ie F and chisq (F for c621, chisq for c681) are not THAT different ..
I mean the concept of a 'cell"?
not the point here..
can use LR test even when only 1 person in each cell...
LR is more to do with no. of extra parameters in model
will use #cells =#df too.. but that is in ch 5 for fit
(we had an example today with 4 cells and 3 parameters) 
When I calculate the df and also the number of parameters in the model, do I have
to consider the number of design variables?
yes.. i emphasized (and I wish they would) the number of terms and the lr tests
usually indicated the no. of terms over and above the intercept..
What about in the calculation of the parameters for model (run the logist regression
and obtain the likelihood function)? Should I create the design variables
i think it best to 'create you own'
Im expect stata has a way to do them automatically but (a) i haven't learned that
part yet and (b) i prefer to have control over it myself.. and they are not that
hard in Stata
see a few of my examples in the stata code i gave early on
gen ir_black = race == 2 if !missing(race)
gen ir_other = race == 3 if !missing(race)
and then calculate the likelihood function? or the result will be the same if
I leave the categorical variables as is?
i dont know what you mean by 'leave as is' but you certainly cant treat each cat/
variable as a single (scalar) numerical variable
if DRPOS is 14, do you expect the same logit gradient from 1no nodule to 2left
to 3right ....... ?
is that what you mean? 
In the assignment, Q3/2 asks us to use a generalized linear model software program
to obtain/calculate a point and a 95% interval estimate for the risk difference.
Do you mean 'genmod'?
yes...
By doing that I could get the estimate risks for both the hypothermia and normothermia
group and then calculate the point
estimate of the risk difference.
need to put 2 groups in 1 model
How could I get the 95% CI then?
I tried 'proc freq' and the 'riskdiff' option, which gave me a asymptotic 95%
CI. But it doesn't seem to be what you want. Could you give me some clues?
no
FYI, the following is my SAS program:
data Q3;
input care $ infection patients;
cards;
hypo 18 96
_norm 6 104
;
run;
proc genmod data=Q3;
class care;
model infection/patients=care;
run;
by the way, dont trust sas with the coding and the ref gp make your own indicator
variables
have to tell genmod which scale you are on ....identity and which distribution..
binomial .
i thought all my genmod eg's mentioned the distribution and link
MUST if use genmod
otherwise get Gaaussin amnd identity
(not what you want here)
data Q3new ;
set Q3;
outcome = 1; number = infection; output;
outcome = 0; number = patientsinfection; output;
run;
proc freq data=Q3new;
tables care * outcome / riskdiff ;
weight number;
run;
this is not a regression

more on aggregation (still all in 1 sample ie no covariate or treatment variable)
Q: Set the data up as 20 separate observations and use a logistic regression software
program to
obtain/calculate a 'logitbased' 95% (frequentist) CI for p.
[see p6 of JH Notes for Chapter 8.1 in course 607]
on bottom of that page 6
From Stata
clear
input n_pos n
1 5
2 5
end
glm n_pos, family(binomial n) link(logit)
the above (from p 6 cited) is an example of AGGREGATED data..
10 invividuals spread over 2 lines
1 5
2 5

3 10 (totals)
so going from this to 3/20 is same idea .. lots of ways
to split them up as long as bottom line is 3/20
try various combinations to convince
yourself that same inference no matter the order
Remember that log of likelihood is sum over lik
for each line
From SAS DATA CI_propn; INPUT n_pos n ;
LINES;
3 10
;
PROC genmod data = CI_propn;
model n_pos/n = /
dist = binomial
link = logit waldci ;
will answer the 2sample q in next email 
My colleagues think that you are asking us to consider different possibilities
(1, 2 and 3 successes)
the examples in the book does not help much ...here the position of my successes
will actually matter (if my successes occured at the begining of end of the 20 trials
!!) could you comment on the way to get the computer to do it and on the answer you
want us to obtain?
in binomial
which were positive in which order is immaterial ie same inference no matter the
position
in SAS or Stata (see below for Stata)
the following would be disaggregated data
ie 1 line = 1 individual
input ID n_observ positive ;
LINES;
1 1 1
2 1 1
etc
19 1 0
20 1 0
;
run;
Title 'Logistic regression for 3 positives observations out of 20';
proc logistic data= dev5_2 descending;
model positive = / rl;
run;
nothing on right hand side so just get beta_0
could also emphasize that den = 1 by running
Title 'Logistic regression for 3 positives observations out of 20';
proc logistic data= dev5_2 descending;
model positive/n_observ = / rl;
run;
by default denom of 1 assumed [ie each record is 1 individual]
why not save typing (AGGREGATE ) ....
data b2;
input ID positive n_observ;
LINES;
1 3 3
2 0 17
;
run;
proc logistic data= b2 descending;
model positive/n_observ = ;
run;
Stata does not permit the following... (SAS does)
Stata wants a dataset with at least 2 records of data
data c2;
input positive n_observ;
LINES;
3 20
;
can also run SAS logistic on this......
proc logistic data= c2 descending;
model positive/n_observ = ;
look at onlinedoc and format of left hand side of proc logistic
********* Several ways in Stata *******
* ** disaggregated (1 record=1 individual) **
clear
input id n_observ positive
1 1 1
2 1 1
etc
18 1 0
19 1 0
20 1 0
end
* by default
logistic positive
logit
* n_observ = 1 for each line, so this gives same est.
logistic positive [fweight = n_observ]
logit
*
* AGGREGATED
*
clear
input pos n
3 10
0 10
end
glm pos , family(binomial n) link(logit)
* another variation
clear
input pos n
2 8
1 12
end
glm pos , family(binomial n) link(logit)
*
* another AGGREGATED
*
clear
input y nmbr
1 3
0 17
end
logistic y [fweight=nmbr]
logit
clear
input y nmbr
1 2
1 1
0 10
0 7
end
logistic y [fweight=nmbr]
logit 
Nous avons beaucoup de difficult/s avec la dernidre question 3b) . Nous avons
cherch/ la commande de SAS pour obtenir un risk difference avec proc genmod et nous
n’avons rien trouv/. puisje trouver mon information?
FIRST
cf Q1 (a) a SINGLE sample, no comparison group
to get proportion directly, what would you get if you had
DATA propn;
INPUT n_pos n ;
LINES;
3 10
;
PROC genmod data = propn;
model n_pos/n = /
dist = binomial link = IDENTITY waldci ;
RUN;
try the above and see.......

NOW, in Q3 you have 2 tx conditions, not 1, so you need a variable to tell sas to
segregate the observations..
to get diff. of 2 proportions directly, what is interpretation of the parameter estimates
from the following ?
DATA propns;
INPUT warm n_pos n ;
LINES;
0 3 10
1 1 10
;
PROC genmod data = propns;
MODEL n_pos/n = warm /
DIST = binomial LINK = Identity waldci ;
RUN;
Info on this in ..
single sample..
[see p6 of JH Notes for Chapter 8.1
but change the link to IDENTITY if want to stay
in proportion (risk) scale
• 2 proportions [prevalences/risks] can be written as 1 master
regression equation using as the 'regressor' variable (usually
called X) an indicator of the index 'exposure' category ( E = 0 if
reference category and = 1 if index category)
Risk[generic E value] = Risk[ref cat.] + RiskDiff if index cat.
[see p1 of JH Notes for Introduction to the Logistic Regression Model ]
for risk diff, key is the identity link
(also, in lecture on monday last, I
did out the example on the whiteboard
of the 6/100 vs 19/100 as a risk DIFF first
and as an ODDS RATIO second..
but i think everyone thinks
odds ratios are sexier (more user look more sophisticated)
than risk differences...
and i guess, given that the course text is logistic
that has same influence..
remember the fitted equation
infection rate = 0.19  0.12 x warm_sgy
so a matter of translating that into a sas statement for the regression
remember i said that risk diff is one of the simplest regressions,because all is
in understandable risk scale.... so could explain to father in law (doubt he would
understand odds ratio equation or an equation in logit scale )
but would understand if (with NO explicit regression in it ) you showed him this
plot ..
options ls=70 ps = 30;
data a;
INPUT warm n_pos n ;
propn = n_pos / n ; * not suggesting this for class.. but !! ;
LINES;
0 3 10
1 1 10
;
PROC PLOT data=a ;
PLOT propn * warm = "*" / HPOS=20 VPOS=21 ;
RUN;
as you will se, I am deliberately not just giving sas commands as such because the
1st thing is to see] what a risk difference IS, in theoretical regression terms (even
before doing it in SAS)
hope this helps....
JH
ps you will have to change the numbers in my eg above..

Donc pour moi la questio 4 a est similaire à 4 e, je ne comprends
pas la distinction à faire. Merci.
in regular regression... ans to part a would be
mu[Y  X] = E{y  X] = beta0 + beta1 * X
variation of Y  X from mu[Y  X] ~ Gaussian(0,signa_sq)
ie not about the actaal data , but IN GENERAL (ie theoretical, the LAW, stated in
terms of PARAMETERS)
ans to q e would be
estimated mu[Y  X] = beta0_hat + beat_hat1 * X (ie empirical, the fitted, stated
in terms of parameter ESTIMATES) etc..
Does that illustrate the difference... ? 
If we have to do a log rank test, is this syntax in SAS correct?
*Log Rank Test (aka Savage Test);
proc npar1way savage data=EPIB681Ass4_8b;
class entry_cat;
var duration;
exact;
run;
i dont see any censoring in that formulation 
25 jan
I'm a little unclear on question 8a. Maybe I'm looking at it in a simple manner
but wouldn't the proportion still in the program be those with a
duration >=6 divided by the total number of PhD students (119).
what to do with the ones who entered in 1998 or 1999?
need to go carefully year by year, just like Bradford Hill, and the most recent
operations
What procedure is the 5 years enough to demonstrate that we understand?
estimate of percent still there
= continued 'fraction of' a 'fraction of' a ....
multiplying the fractions is easy
setting them up is a bit trickier.. as per your Q 
we have two main questions:
1) On "Homegrown" Question 3, when we try to do the Wolf method, we come
across with some divisions over 0 (the OR of some strata, and the variances). How
should we deald with the OR when we have to % by 0??
drop this stratum ...
(some people add a 1/2 to each of 4 cells, but don't do that here.. point is that
woolf gets in trouble if 1 or other of ad or bc is zero MH doesn't drop the stratum
(consider it uninformative) until BOTH ad and bc are zero.. )
BUT woolf is great for large strata (as in his own 1955 paper)
Should we consider then the variance equal to infinity and then the weight equal
to 0 ??
I guess you could.. conceptually that is what you are doing by dropping the stratum
!
easier to just DROP than to ask your calculator to show what the reciprocal of infinity
is !
I am curious what the various softwares do for woolf in the situation above..
might be worth checking.. 
a couple of more questions please
1. the table from the NEJM article on vaccines and autism, does not give the number
of unvaccinated children with and without the disease in each strata and therefore
I am having difficulty to calculate the RR in each strata.
remember that it is only the RELATIVE sizes of the exposed and unexposed person
years that matter.. it was about 67K children born each year (if divide total no.
children by no. of cohorts)
2. when you ask for calculations in each stratum, do you mean for each age subgroup
under the groups "age of vaccination",
age ONLY .. so 8 age groups, take 1year age groups
you can add the events and py from the halfyears and the full squares to make one
py per agecategory
[ Since focus here is on how the calculations work, ignore (collapse over) calendar
time for now.
" interval since vaccination" and "year of vaccination", which
would be 17 subgroups in total? Is that what you want, or you just want these calculations
for "age of vaccination"?
not interested in interval since, just age 
jan 16
The testbased is confusing (I don't recall from the class and there is no good example
in the JH notes).
there is an example.. how good the example is of course subject to opinion..
i would have thought the worked eg on page 6 of the ch 9 notes for epi .. is fairly
clear... (there is another a few pages on
but with 6 strata)
and mantel and haenszel themselves have a very clearly worked out chisq test
.. so the last part of calculating the testbased CI is a small one...
Do you want me to calculate a test based CI with the ORmh being the point estimate?
yes, that is the usual way (I mentioned in class that {OR point est. by MH, testbased
CI by Miettinen"} was an often used pairing in the early days before Robins
et al...
If so, how do I get this "observed z statistics"?
the z statistic is the square root of the chisq statistic.. and vice versa

15 jan
I am confused in terms of the formulas for OR and RR CI your notes present 2 different
ones for each in pages 4 and 6 MM 8.2 so my questions are:
there are even more... there is also a conditional one i didnt mention..
1. do I need to know these formulas by heart for the exam
exam will be open book
or they are for our information and future use?
both
2. when do I choose one or the other?
test based is faster (if by hand and already have the chisq test done) and accurate
near the null .. not quite so accurate as one leaves the null
nowadays people use a computer so less an issue and testbased not as common (was
more so in early days)

13 jan
Is there a circumstance where I should prefer the method ClopplerPearson CI instead
of looking at the table or claculating the exact CI?
KlopperPearson is the principle of getting 2 limits L and U such that (exactly)
Prob(>= observed  L ) = prob( <= observed  U ) = alpha/2 ie for 100*(1alpha)
CI
can compute it ..
 by trial and error (as in excel, but alos if have the binomial tables on paper)
 directly using a mathl. link between binomial tail areas and the F distribution
(so if have tables of F distrn, on paper or excel)
do I have to know the clopperpearson formula for the exam?
for exam, i would be interested that people could do it if given a paper table of
binomial probs. ie they should know the principle above

But, since you took the time to look, here is a link with a few statistics jokes.
I like the last one, about the statisticians and the epidemiologists on the train.
http://personal.ecu.edu/wuenschk/humor/stats.txt 
