513-607: Principles of Inferential Statistics (Fall 2001)

McGill University, Department of Epidemiolog, Biostatistics and Occupational Health
EPIB-634: Survival Analysis and related Topics (Winter 2007)

Frequently Asked Questions (FAQs), Clarifications etc...


Question / Suggestion
RESPONSE/CLARIFICATION
-- wed jan 10 -- Question 5, I do not understand what you mean by quasi-denominators. What is the rationale behind dividing the original counts of cases by the counts of cases in the new sample ("survey")?

I did not suggest “ by the counts of cases in the new sample ("survey")?”

I suggested going to 100 RANDOMLY SELECTED HOUSES to find out about the water supply

it’s like we have a team who worked on getting right numerators (286 and 14) , and another on the denominators. In fact the companies provided the denominators (say N1 and N0 houses) the YEAR before the epidemic, for other purposes, But IF we didn’t have that survey, Snow could go on foot to 100 randomly selected houses (without regard to whether it contained one or more cases!) or 1000 houses.. Etc

ie he could say take every 50th house.. Then with n1 and n0 houses (1=exposed, 0 not) he could reason

for the exposed, and unexposed we have an

ESTIMATED PT denominator of 50 x n1 (houses) x 7 persons per house x 4 (weeks)

ESTIMATED PT denominator of 50 x n0 (houses) x 7 persons per house x 4 (weeks)

now divide the 286 and the 14 by these two, and take their ratio and what do you get?

it looks like (AFTER CANCELLATION OF THE 50 and 7 AND 4 )

(286 / n1 ) divided by (14 / n0)

they are 50 times too small, n1 and n0 are only partial denominators

but I the final result it appears as though they are real denominators – they aren’t of course, so we might call them quasi denominators

For part ii, if I use the following margin of error to calculate the CIs:
exp [z * V(1/a + 1/b + 1/c + 1/d)]
then the CIs should narrow as the sample size decreases, right?


1/a and 1/b don’t change but 1/c and 1/d (and thus the SE of the log) get smaller as c an d get bigger

so, surely the other way round

Is this the wrong formula?

NO .. And It accounts for fact that you sampled the denominators..

Question 6. We did not learn how to calculate CIs for ratios by a usual manual way in 607. We did calculate CIs for parameters such as means and proportions. For those CIs, the general formula was parameter estimate + or - z times SE.


If I use that method, then I would take the ratio and + or - 2 times the SE for a 95% CI.
In your example on page 2, the SE = 2.5 for the ratio of 13.1. So the CI calculated using that method would be 13.1 - 5, 13.1 + 5, or (8.1, 18.1), which is very different from the CI of (5.2, 32.8). Is this the right approach?


the 2.5 is a <<multiplicative>> MARGIN of error, not a STANDARD error

I was unfortunately unable to attend the tutorial you gave this afternoon because I had an EPI class at that time. Are you available in Purvis anytime tomorrow afternoon after 1pm?

Yes I will be there... Come see me.. Its indeed better face to face

I am getting pretty confused with questions 5 and 6, and it may be better if I asked my questions face-to-face. If not, any help you can provide by e-mail would be appreciated.

I think that part of what is playing out here is that from an epi point of view, pint estimates are easy and don’t need to involve a statistical ‘model’(its just intuitive) .. But CI’s do because they require us to calculate variances, and variances need a model. Of course there are also lots of new stats ideas too, quite apart from the epi.
-- wed jan 10 -- For #6 are you implying that the usual way would be to just add and subtract the MOE for the log rate ratio?

yes indeed.. usual is what rothman suggests work out CI in log scale ie with +/- MOE. Then exponentiate both limits

Or would the "usual way" be to take the log of the MOE?

Not sure how this would work. Where would one find an MOE to take a log of (pardon my grammar!)?

For #8 I've done the computing portion, but what is the formula for an exact CI for the rate ratio? I've read page 29 several times and do not see any of the formulas state that they are "exact." Pages 137-139 of Rothman also do not mention whether or not the formula provided is "exact."

It is exact in the sense that one can use an exact Binomial CI to get there.

To obtain a CI, we treat the proportion of exposed cases, 0.732, as a binomial proportion, based on 41 "positives" out of a total of 56 cases (obviously, if the proportion were based on 8 exposed cases out of 11 cases, or 410 out of 560, the precision would be very different!)

From table/other source of E X A C T CI's for proportions (see e.g. table on 607 web page), can determine that 95% CI for p is p_L=0.596 to p_U=0.842.

Substitute these for the point estimate to get

RR_L = (0.596 / 0.404) / (28010/19017) = 1.00
RR_U = (0.842/0.158) / (28010/19017) = 3.61

Rothman & Walker emphasize formula RRL,U = {pL/U / (1-pL/U) } / {n1 / n0} over basis for it.

the point of this test is that by conditioning on a sum of 2 Poisson random variables, the distribution of one of them is BINOMIAL (its a simple result to prove in a math-stat course)

so we have gone from 2 dimensions to 1 and the Binomial lends itself to EXACT methods (We didn’t cover exact Binomial Binomial CIs, as I thought they were covered in 607)

but what is the formula for an exact CI for the rate ratio?

RR_one limit = (P_upper / 1 — P_upper) / (n1/n0)
RR_other limit = (P_lower / 1 — P_lower) / (n1/n0)

with P_lower and P_upper being the EXACT limits based on observing a binomial proportion c1 / (c1 + c0) in a sample of size (c1+c0)
-- tue jan 9 -- In question 3 you tell us to concentrate on one cell (33 and 36.3). In order to calculate a chi-squared stat, do we use 1 degree of freedom? Since we usually would use (r-1)(c-1), to me this gives zero degrees of freedom but that makes no sense.

These rules sometimes break down.

And chi-sq isn't the only test

Remember chi-sq with 1 df is just square of z. Why not z = (o-e)/sqrt(E)

That's legit.. O given E is Poisson but close to Gaussian (36.3 is large)

Now square it and you get z-sq = chi-sq(1) = (O-E)^2 / E .......Familiar?

A way to understand these is to look at my notes from 607, where I speak of 2 x 2 tables, 2 x1 tables, 1 x 2 tables and even 1 x 1 tables (The one we have here) See
p 9 of my notes on chi-sq for 607 (all my notes for 607 are on the 607 website)

moral: (in second term) there more than one way to look at 1 number and the null value its compared with.. Same with larger tables!! Even in first term, several ways to skin a cat.. I never like teachig chi-sq for a comparison of 2 random proportions or numbers with each other, or one with a fixed value, because z-test does same think, and doesn't loose the sign .. Squaring does....
-- tue jan 9 -- For question 5, regarding John Snows rates and intervals, I actually found the opposite trend to what you said we would find.

My CIs for 1000 were more narrow and for 100 were even narrower. This could be because of the forumula I used for the

SE(ln(RR)= sqrt (1/a + 1/b - 1/N1- 1/N2)
* I considered the rates to be risk ratios and not odds ratios or incidence rate ratios (e.g. we are looking at number of cases/ no at risk and this is a risk ratio is it not?)


NO, even with full denominators, these are not risks. The full denominators are (people in this many house-holds x 4 weeks)

And so if we take say 7 persons per house, (as Snow had to do until he got numbers of people later) We have approx 40k x 7 x 4 person weeks and 26K x 7 x 4 person weeks

By your logic, if we took a sample of 25 houses, we would have binomial denominators of 15 and 10, and NEGATIVE variances because one denom (10) is less than the numerator (14) But in fact a denomiminator sample of 25 yields a proper estimate, albeit a very imprecise one because in addition to the fact that the 286 and 14 are random, now the 10:15 estimate of the ratio of exposed:unexposed PT is very imprecise to start with. Think of the 1/c + 1/d as the price for partial denominators, rather than full ones ie if c->40K and d->26K, these 2 extra prices would become very small.

In fact, think of c and d as substitutes for the 2 person-times. And note that even though ad/bc has a classic odds ratio form, in fact its ' true structure' is (a/partial PT demom) / (b/partial PTdenom). And c and d are not even in the same units as a and b!!


i.e. think of it as

ad/bc = (a/b) / (c/d) is AN ESTIMATE of (a/full PT demom) / (b/full PTdenom) and

(a/full PT demom) / (b/full PTdenom) is an estimate of the true RATE ratio

Out of interest, do you know how Snow established which 286 of the 300 in the numerator lived in houses supplied by the Southwark and Vauxhall Company and which 14 by the Lambeth company?


* I can see how the CIs would become wider if we were considering ORs where SE(ln(OR)) = sqrt (1/a + 1/b + 1/c + 1/d) which would become larger for smaller denominators (100 instead of 1000) etc. But again, the rates we are given are they not attack rates?? Not incidence rates or odds ratios? If so, do we not use the CI= e{ln(RR) +/1.96 * sqrt(1/a +1/b - 1/N1 -1/n2)??

see above..

Modern epi is at once simpler, and at the same time more subtle than what we have to learn from most epi texts

In Unit 7_B (524-207G) Introduction to Epidemiology Sept 03-06, 2002 in bold You will see a very modern depiction of the real purpose of 'controls' in a c-c study .. Even though R+G has a reputation as a tough book to read, the intro chapters are quite straightforward, and I picked excerpts from them to show to the med students 4 years ago !!
-- tue jan 9 -- For problem 4, can you please tell me the proper code to calculate ORs and CIs by using R? I've been trying to figure it out by using page 25 of the epitools manual but I can't get it to work. I do not understand how to enter the 2x2 matrix as data.

PAGE 27 (see also some info on page 26)

=== for RateRatio ==

epitab(matrix(c(41, 15, 28010, 19017),2,2)[2:1,], method="rateratio", verbose = TRUE)



Even though Ayas et al call it an OR, I to think of it as a rate ratio. This is after all a cohort study!

=== for OR ==

epitab(matrix(c(41, 15, 28010, 19017),2,2)[2:1,], verbose = TRUE)
--mon jan 8 -- I have some questions about the 634 assignment. I read your post about #3 but I still don't understand where the O=2 comes from in the eg. of p.24... why not keep 1 or take 3....??

if O were 1, then to have an SIR of 3.5, the CI would be CI[based on 1 case] / E, with E = 1/3.5. It gives too wide a CI.
O=3 (and associated E) gives too narrow a CI

In question 6, we have to hand-calcute the CI for some odds ratio by our USUAL manual way (not the multiplied-by/divided-by method)... I though this was the only way to do it with odds ratio (this is in fact the only method used by Rothman)!!!

there are several ways .. if use an unconditional approach .. (used by Rothman) if use a conditional approach, different again.

So maybe I just don't understand what we can treat as constant, E, in this case. Can we say that O=#exposed cases is Poisson while the rest of the ratio is E, ie denom.exp/(#unexp.cases/denom.unexp)=E and is constant?

nothing to do with E here.. its a comparison of 2 rates, with known PT denominators for each one (so PT 's are the 2 constants) but two samples are closer in size, not like the alberta one where the exposed sample is 1000 times smaller than the other.

in q 6, for Snow's data, the denominators are houses, not people (so like person time in a way, or at least proportional to it) but we presume that the numbers of houses are KNOWN and not the result of some estimation (hard to believe in 1850s they would be able to be so exact... today we could ask who gets cable from videotron, and who from some other company, but then ...?)
For problem #3, can you tell me which formula that the authors used to calculate the 95% CI's for the SIRs? I am going by the footnotes on page 24 that the CI is the CI for the observed divided by the expected, but I can't reproduce their CIs for the expected 1-3 groups by any of the formulas I have tried in the handouts.

I asked one of the authors (S Suissa) and he said another author G Hill did it, but its a long time ago and I no longer have the report, so I can't say for sure.

One thing (if yours are NARROWER than theirs) is that they might have incoroporated the fact that the reference area is not THAT much larger than the index area, and so there is some uncertainty in the E too.

But it could also be that they used yet another method again. we have two issues (exact and approx) and also whether to use mid-p or conventional p, etc..

Also, for test part of the same problem, will you let us simply conduct the test using the confidence interval?

Sorry, I did indeed couch it as a simple yes-sig with p < alpha, or no p > alpha. But please calculate a p-value from 1st principles , since journal referees and bosses typically want to see a p-value. {Note that I said A p-value, not THE p-value.. again there are exact and gaussian-approx p-values, and those that use mid-p (they count 1/2 the probability of the observed, plus all of the probbailityies for those more extreme values) or conventional p-values (they count ALL of the probability of the observed) -- I have notes on this on my 607 website.. will try to locate them}.
--sun jan 7 -- Assignment 1, q3 [ CI for SIR or SMR ]

On Friday last, I spent more time that I had planned to go over the Poisson distribution, and the idea of exact confidence intervals, so I didn't get time to remark on one type of 'epi data analysis' that Rothman does not cover fully in his simple analysis chapter, namely tests/CIs for a SIR or SMR. (q 3 on the first assignment). He covers standardization in his next chapter, but uses an example where the 2 compared amounts of experience are not that different in size. When one of the two is much larger than other, statistical inference is simplified, and that's why I have it in the this, rather than next week's assignment.

The example on page 24 of the notes I handed out shows an example [ which we had to 'reverse-engineer']. The important point is that the CI reflects only the uncertainty in the O ie the top of the ratio, since the bottom (the E) reflects a scaled down version of a much larger experience. One useful way to look at it is to say that the SIR of 3.5 in that e.g. is really a rate ratio of the classical a/PT vs b/PT type you will deal with in q's 4(i), 5 and 6, but with a << b.

In this example, if we know assume that the Rest-of-Ontario-Children-Years 1964-1985 are say 1000 times the Douglas-Point-Children-Years (DPCY) for the same period (they are probably more than 1000 times larger!) , then we have 3.5 = (2/DPCY) / (570/ 1000DPCY). So in Stata you can put: iri 2 570 1 1000 to get the CI. To convince yourself, also put in iri 2 5700 1 10000 . Even iri 2 57 1 100 isn't that far off -- as long as the number of cases in one rate is much larger than the number of cases on which the other is based, the uncertainty in the rate ratio (CIR) is virtually all from the rate based on the smaller no. of cases.

Note : if you are going the (free) 'R' route , the command in the Epitools is just like iri in Stata, but with opposite order for the compared categories ie

library(epitools)
rateratio(c(#unexposed cases,#exposed cases,amount of unexposed PT, amount of exposed PT))

Besides their different layouts, you will also find some statistical discrepancies b/w Stata and Epitools, because they use slightly different principles. I will report back to you on which I think it is just that, or if one of them is also more trustworthy.

For CIs for an SIR, can always take the CI from the appropriate "O" row of table on p 27, then simply divide it by E. And for exact p-values for tests, one can use the (handed out, and on the website)Table of Poisson probabilities (or the dpois function in R, or the POISSON function in Excel), or -- when the mean or expected number is large enough -- the Normal approximation.
--sun jan 7 -- Q (raised in class )

Is it true that PoissonProb(9 events | mu=10) = PoissonProb(10 events | mu=10)?

A: Yes

From 1st principles..

PoissonProb(9 events | mu=10) = exp[-10] x (10^9) / 9!

PoissonProb(10 events | mu=10) = exp[-10] x (10^10) / 10!

The extra power of mu (10 instead of 9) is offset by the extra term (10) in the 10! in the divisor

It also goes to show why the Poisson distribution is never 'quite' symmetric, even if mean is large; if mu is small, distribution has quite a long right tail

Try to avoid the terms left-skewed and right-skewed, because half the world interprets left skew one way and the other half interprets it the other way! If you say long right or long left tail, everyone will be on same page with you.


Since you took the time to look, here is a link with a few statistics jokes. I like the last one, about the statisticians and the epidemiologists on the train.

http://personal.ecu.edu/wuenschk/humor/StatHumor.htm