513-607: Principles of Inferential Statistics (Fall 2001)

McGill University, Department of Epidemiology and and Biostatistics
EPIB-681: Data Analysis II (Winter 2004)

Frequently Asked Questions (FAQs)

Question / Suggestion
Walker (pg 109 Observation and Inference) calculates the RR in a poisson distribution by a different formula than the one seen in JH notes.
He defines the Var (lnRR)= 1/E+cases + 1/E-cases however, I don't find the same answers when I apply this formula to the exampples from your notes

on right hand column of p 29 of Poisson notes, i give the same formula .... but with just different notation...
(I say its is like "Woolf" with the denominators so large that 1/denom is zero )

var [ ln rr ] = 1/c1 + 1 c0 ]

c = # cases

subscript 0 = E- ; subscript 1 = E+

Is p29 where you were looking? or in some other notes of mine?

would you accept this approach?

sure .. since I think we are both taking the same approach

also, presumably, when you say RR in a poisson distribution

you refer to TWO poisson variables, one for exposed, one not

Also, here, when I convert the logit to proportion for year_87 (beta=-0.0032), I get pi=0.5.

1. a beta is not a logit.. is is a DIFFERENCE in logits.. !!

the only beta that is a logit is the beta_0

so exp[beta] is the increase/decrease on odds

How can the probability of injury increase by 0.5 for every unit of time (1-year) when pi has to be
between 0 and 1?

exp[-0.0032] is what is relevant.. and it is applied to (acts on) the odds...

exp[-0.0032]/(1 + this) doesn't make sense here

exp[-4.8089]/(1 + this) does ..

2) Question 4, part I. Do we need to use proportions from this example

no.. you need to head higher !! they don't diverge enough at the proportions we are studying, do they?

(by the way.. am treating injury rates as 'proportions' here for now ...... will eventually move on to regression for rates.. but not now.. )

or can we compare the log and logit of ficticious proprotions (i.e. ranging from say 0.001 to 0.1)?

yes.. am curious where (without calculations) you think it will be ?

likewise, at what point do you consider we are in the 'rare disease' territory? [somewhat related... ]

would you mind telling me the importance and the meaning of performing a chi-square test on the strata specific ln[OR] ?

i wouldn't mind at all...

are you talking about the last section in chapter , where the emphasis is on a few 2 x 2 tables..

are they talking a test for homogeneity of ORs ?

it would be like asking if the effect is same across strata ? ie a test of effect modification..

ie are odds ratios same for different folks?

The H&L presents that as a test for homogeneity but do not explain the consequences of it.

if not homogeneous, and if very different then answer re effect is 'it depends' ie not a 'one OR for all' story..

and so need to report or's separately for each level of modifier.

Does this mean that I can not use a pooled unconfounded OR?

if ORs are very different (beyond random variation) what would be the purpose of an 'average' OR?

a bit like saying the average person has one testicle and one ovary!

don't get into confounding.. since cant say JUST from test of homogeneity if there is or not (see next q)

Does this mean that there is confounding??

If i tell you that the or in males is 2.0 and in females 2.0 can you tell just from that whether there is confounding?
NO .. all depends on how exposure and gender are distributed in the sample
AND on whether gender is even a risk factor in 1st place..
when comparing univariate with multiple regression models H&L states that "in MR there will be p+1 likelihood EQUATIONS "...what does it mean by that? I thought the equation was the same, just more coefficients were included....

they are talking about ESTIMATING equations (the ones to find the ML estimates..)

remember we tried to manually (by trial & error) maxinize a likelihood

another way would be to find the tangent (derivative) and solve to see where it is zero

that is in "1-D" because in my very simple example, just 1 parameter

same principle in "p+1"-D if maximizing over p+1 parameters

for example, where is the max in p 6 of my 2-D likelihood in Intro to lr notes .. need t watch tangents in north-south (B1) and east-west (B0) directions

by the way, how many estimating equations do you think are involved in fitting the LS (least sq) estimates in c621.... same #: p+1 !!!!

what is the number of the degrees of freedom I should use in the log likelihood ratio? Is it k-1? (k would mean all the coeff, including the intercept)

what exactly is the like RATIO ?

ratio of WHAT to WHAT ??

if you look at the WHAT and WHAT, the no. of extra parameters will be obvious

do I have to know how to set up the dummy variables by this oher method "deviation from the means code"?

I said that we prefer reference coding BUT you should be aware that there are others and that you need to know what is default in the software you use.. eg have you looked into how stata handles categorical variables and how it does coding (if indeed it does_ ?

we learned chi-square for comparison of counts in tables and it was clear that each cell should have a minimum of 5 individuals in order to the chi square to "work" well. Now, how does this 'concept" translate in the LR?

DOESN'T ....

a distribution (like chi-sq or F) can serve MANY different purposes..

eg can use an F test to test if two variances are equal

can also use it to test if 2 or more means are equal (ANOVA)

and you didn't object to multiple uses then!

Also, in my notes on ch 2 i wrote

In the limit, with a large n, under the null , 2 times the F[2, n-6] statistic has a distribution close to a chi-square distribution with 2 degrees of freedom -- i.e. the 2 tests are the same, With binary Y's, we don't spend degrees of freedom estimating a separate variance parameter for the variation of Y around its expectation-- we assume it is Bernoulli.. i.e. it is like having an infinite number of degrees of freedom in the denominator of the (partial) F statistic -- thereby making it like a chi-square statistic.

ie F and chi-sq (F for c621, chi-sq for c681) are not THAT different ..

I mean the concept of a 'cell"?

not the point here..

can use LR test even when only 1 person in each cell...

LR is more to do with no. of extra parameters in model

will use #cells =#df too.. but that is in ch 5 for fit
(we had an example today with 4 cells and 3 parameters)
When I calculate the df and also the number of parameters in the model, do I have to consider the number of design variables?

yes.. i emphasized (and I wish they would) the number of terms and the lr tests usually indicated the no. of terms over and above the intercept..

What about in the calculation of the parameters for model (run the logist regression and obtain the likelihood function)? Should I create the design variables

i think it best to 'create you own'

Im expect stata has a way to do them automatically but (a) i haven't learned that part yet and (b) i prefer to have control over it myself.. and they are not that hard in Stata

see a few of my examples in the stata code i gave early on

gen ir_black = race == 2 if !missing(race)
gen ir_other = race == 3 if !missing(race)

and then calculate the likelihood function? or the result will be the same if I leave the categorical variables as is?

i dont know what you mean by 'leave as is' but you certainly cant treat each cat/ variable as a single (scalar) numerical variable

if DRPOS is 1-4, do you expect the same logit gradient from 1-no nodule to 2-left to 3-right ....... ?

is that what you mean?
In the assignment, Q3/2 asks us to use a generalized linear model software program to obtain/calculate a point- and a 95% interval estimate for the risk difference. Do you mean 'genmod'?


By doing that I could get the estimate risks for both the hypothermia and normothermia group and then calculate the point
estimate of the risk difference.

need to put 2 groups in 1 model

How could I get the 95% CI then?

I tried 'proc freq' and the 'riskdiff' option, which gave me a asymptotic 95%
CI. But it doesn't seem to be what you want. Could you give me some clues?


FYI, the following is my SAS program:

data Q3;
input care $ infection patients;
hypo 18 96
_norm 6 104

proc genmod data=Q3;
class care;
model infection/patients=care;

by the way, dont trust sas with the coding and the ref gp make your own indicator variables

have to tell genmod which scale you are on ....identity and which distribution.. binomial .

i thought all my genmod eg's mentioned the distribution and link

MUST if use genmod
otherwise get Gaaussin amnd identity
(not what you want here)

data Q3new ;
set Q3;
outcome = 1; number = infection; output;
outcome = 0; number = patients-infection; output;

proc freq data=Q3new;
tables care * outcome / riskdiff ;
weight number;

this is not a regression
more on aggregation (still all in 1 sample ie no covariate or treatment variable)

Q: Set the data up as 20 separate observations and use a logistic regression software program to
obtain/calculate a 'logit-based' 95% (frequentist) CI for p.

[see p6 of JH Notes for Chapter 8.1 in course 607]

on bottom of that page 6

From Stata

input n_pos n
1 5
2 5
glm n_pos, family(binomial n) link(logit)

the above (from p 6 cited) is an example of AGGREGATED data..
10 invividuals spread over 2 lines

1 5
2 5
3 10 (totals)

so going from this to 3/20 is same idea .. lots of ways
to split them up as long as bottom line is 3/20

try various combinations to convince
yourself that same inference no matter the order

Remember that log of likelihood is sum over lik
for each line

From SAS DATA CI_propn; INPUT n_pos n ;
3 10
PROC genmod data = CI_propn;
model n_pos/n = /
dist = binomial
link = logit waldci ;

will answer the 2-sample q in next email

My colleagues think that you are asking us to consider different possibilities (1, 2 and 3 successes)
the examples in the book does not help much ...here the position of my successes will actually matter (if my successes occured at the begining of end of the 20 trials !!) could you comment on the way to get the computer to do it and on the answer you want us to obtain?

in binomial

which were positive in which order is immaterial ie same inference no matter the position

in SAS or Stata (see below for Stata)

the following would be dis-aggregated data

ie 1 line = 1 individual

input ID n_observ positive ;
1 1 1
2 1 1
19 1 0
20 1 0


Title 'Logistic regression for 3 positives observations out of 20';
proc logistic data= dev5_2 descending;
model positive = / rl;

nothing on right hand side so just get beta_0

could also emphasize that den = 1 by running

Title 'Logistic regression for 3 positives observations out of 20';
proc logistic data= dev5_2 descending;
model positive/n_observ = / rl;

by default denom of 1 assumed [ie each record is 1 individual]

why not save typing (AGGREGATE ) ....

data b2;

input ID positive n_observ;

1 3 3
2 0 17

proc logistic data= b2 descending;
model positive/n_observ = ;


Stata does not permit the following... (SAS does)
Stata wants a dataset with at least 2 records of data

data c2;
input positive n_observ;
3 20

can also run SAS logistic on this......

proc logistic data= c2 descending;
model positive/n_observ = ;

look at onlinedoc and format of left hand side of proc logistic

********* Several ways in Stata *******

* ** disaggregated (1 record=1 individual) **


input id n_observ positive
1 1 1
2 1 1


18 1 0
19 1 0
20 1 0

* by default 

logistic positive

* n_observ = 1 for each line, so this gives same est.

logistic positive [fweight = n_observ]

input pos n
3 10
0 10
glm pos , family(binomial n) link(logit)

* another variation

input pos n
2 8
1 12
glm pos , family(binomial n) link(logit)

* another AGGREGATED 

input y nmbr
1 3
0 17
logistic y [fweight=nmbr]

input y nmbr
1 2
1 1
0 10
0 7
logistic y [fweight=nmbr]
Nous avons beaucoup de difficult/s avec la dernidre question 3b) . Nous avons cherch/ la commande de SAS pour obtenir un risk difference avec proc genmod et nous n’avons rien trouv/. puis-je trouver mon information?


cf Q1 (a) a SINGLE sample, no comparison group

to get proportion directly, what would you get if you had

DATA propn;
INPUT n_pos n ;
3 10
PROC genmod data = propn;
model n_pos/n = /
dist = binomial link = IDENTITY waldci ;

try the above and see.......


NOW, in Q3 you have 2 tx conditions, not 1, so you need a variable to tell sas to segregate the observations..

to get diff. of 2 proportions directly, what is interpretation of the parameter estimates from the following ?

DATA propns;
INPUT warm n_pos n ;
0 3 10
1 1 10
PROC genmod data = propns;
MODEL n_pos/n = warm /
DIST = binomial LINK = Identity waldci ;

Info on this in ..

single sample..
[see p6 of JH Notes for Chapter 8.1

but change the link to IDENTITY if want to stay
in proportion (risk) scale

• 2 proportions [prevalences/risks] can be written as 1 master
regression equation using as the 'regressor' variable (usually
called X) an indicator of the index 'exposure' category ( E = 0 if
reference category and = 1 if index category)
Risk[generic E value] = Risk[ref cat.] + RiskDiff if index cat.

[see p1 of JH Notes for Introduction to the Logistic Regression Model ]

for risk diff, key is the identity link

(also, in lecture on monday last, I
did out the example on the whiteboard
of the 6/100 vs 19/100 as a risk DIFF first
and as an ODDS RATIO second..

but i think everyone thinks
odds ratios are sexier (more user look more sophisticated)
than risk differences...

and i guess, given that the course text is logistic
that has same influence..

remember the fitted equation

infection rate = 0.19 - 0.12 x warm_sgy

so a matter of translating that into a sas statement for the regression

remember i said that risk diff is one of the simplest regressions,because all is in understandable risk scale.... so could explain to father in law (doubt he would understand odds ratio equation or an equation in logit scale )

but would understand if (with NO explicit regression in it ) you showed him this plot ..

options ls=70 ps = 30;
data a;
INPUT warm n_pos n ;
propn = n_pos / n ; * not suggesting this for class.. but !! ;   
0 3 10
1 1 10
PROC PLOT data=a ;
PLOT propn * warm = "*" / HPOS=20 VPOS=21 ;

as you will se, I am deliberately not just giving sas commands as such because the 1st thing is to see] what a risk difference IS, in theoretical regression terms (even before doing it in SAS)

hope this helps....


ps you will have to change the numbers in my eg above..
Donc pour moi la questio 4 a est similaire à 4 e, je ne comprends pas la distinction à faire. Merci.

in regular regression... ans to part a would be

mu[Y | X] = E{y | X] = beta0 + beta1 * X

variation of Y | X from mu[Y | X] ~ Gaussian(0,signa_sq)

ie not about the actaal data , but IN GENERAL (ie theoretical, the LAW, stated in terms of PARAMETERS)

ans to q e would be

estimated mu[Y | X] = beta0_hat + beat_hat1 * X (ie empirical, the fitted, stated in terms of parameter ESTIMATES) etc..

Does that illustrate the difference... ?

If we have to do a log rank test, is this syntax in SAS correct?

*Log Rank Test (aka Savage Test);
proc npar1way savage data=EPIB681Ass4_8b;
class entry_cat;
var duration;

i dont see any censoring in that formulation
25 jan

I'm a little unclear on question 8a. Maybe I'm looking at it in a simple manner but wouldn't the proportion still in the program be those with a
duration >=6 divided by the total number of PhD students (119).

what to do with the ones who entered in 1998 or 1999?

need to go carefully year by year, just like Bradford Hill, and the most recent operations

What procedure is the 5 years enough to demonstrate that we understand?

estimate of percent still there

= continued 'fraction of' a 'fraction of' a ....

multiplying the fractions is easy

setting them up is a bit trickier.. as per your Q
we have two main questions:

1) On "Homegrown" Question 3, when we try to do the Wolf method, we come across with some divisions over 0 (the OR of some strata, and the variances). How should we deald with the OR when we have to % by 0??

drop this stratum ...

(some people add a 1/2 to each of 4 cells, but don't do that here.. point is that woolf gets in trouble if 1 or other of ad or bc is zero M-H doesn't drop the stratum (consider it uninformative) until BOTH ad and bc are zero.. )

BUT woolf is great for large strata (as in his own 1955 paper)

Should we consider then the variance equal to infinity and then the weight equal to 0 ??

I guess you could.. conceptually that is what you are doing by dropping the stratum !

easier to just DROP than to ask your calculator to show what the reciprocal of infinity is !

I am curious what the various softwares do for woolf in the situation above..

might be worth checking..

a couple of more questions please

1. the table from the NEJM article on vaccines and autism, does not give the number of unvaccinated children with and without the disease in each strata and therefore I am having difficulty to calculate the RR in each strata.

remember that it is only the RELATIVE sizes of the exposed and unexposed person years that matter.. it was about 67K children born each year (if divide total no. children by no. of cohorts)

2. when you ask for calculations in each stratum, do you mean for each age subgroup under the groups "age of vaccination",

age ONLY .. so 8 age groups, take 1-year age groups

you can add the events and py from the half-years and the full squares to make one py per age-category

[ Since focus here is on how the calculations work, ignore (collapse over) calendar time for now.

" interval since vaccination" and "year of vaccination", which would be 17 subgroups in total? Is that what you want, or you just want these calculations for "age of vaccination"?

not interested in interval since, just age

jan 16

The test-based is confusing (I don't recall from the class and there is no good example in the JH notes).

there is an example.. how good the example is of course subject to opinion..

i would have thought the worked eg on page 6 of the ch 9 notes for epi .. is fairly clear... (there is another a few pages on
but with 6 strata)

and mantel and haenszel themselves have a very clearly worked out chi-sq test .. so the last part of calculating the test-based CI is a small one...

Do you want me to calculate a test based CI with the ORmh being the point estimate?

yes, that is the usual way (I mentioned in class that {OR point est. by MH, test-based CI by Miettinen"} was an often used pairing in the early days before Robins et al...

If so, how do I get this "observed z statistics"?

the z statistic is the square root of the chi-sq statistic.. and vice versa
15 jan

I am confused in terms of the formulas for OR and RR CI your notes present 2 different ones for each in pages 4 and 6 MM 8.2 so my questions are:

there are even more... there is also a conditional one i didnt mention..

1. do I need to know these formulas by heart for the exam

exam will be open book

or they are for our information and future use?


2. when do I choose one or the other?

test based is faster (if by hand and already have the chi-sq test done) and accurate near the null .. not quite so accurate as one leaves the null

nowadays people use a computer so less an issue and test-based not as common (was more so in early days)

13 jan

Is there a circumstance where I should prefer the method Cloppler-Pearson CI instead of looking at the table or claculating the exact CI?

Klopper-Pearson is the principle of getting 2 limits L and U such that (exactly)

Prob(>= observed | L ) = prob( <= observed | U ) = alpha/2 ie for 100*(1-alpha) CI

can compute it ..
- by trial and error (as in excel, but alos if have the binomial tables on paper)
- directly using a mathl. link between binomial tail areas and the F distribution (so if have tables of F distrn, on paper or excel)

do I have to know the clopper-pearson formula for the exam?

for exam, i would be interested that people could do it if given a paper table of binomial probs. ie they should know the principle above

But, since you took the time to look, here is a link with a few statistics jokes. I like the last one, about the statisticians and the epidemiologists on the train.