McGill University, Department of Epidemiolog,
Biostatistics and Occupational Health
EPIB-634: Survival Analysis and related Topics (Winter 2007)
Frequently Asked Questions
(FAQs), Clarifications etc...
Question / Suggestion
RESPONSE/CLARIFICATION |
-- wed jan 10 -- Question 5,
I do not understand what you mean by quasi-denominators. What is the rationale behind
dividing the original counts of cases by the counts of cases in the new sample ("survey")?
I did not suggest “ by the counts of cases in the new sample ("survey")?”
I suggested going to 100 RANDOMLY SELECTED HOUSES to find out about the water supply
it’s like we have a team who worked on getting right numerators (286 and 14) , and
another on the denominators. In fact the companies provided the denominators (say
N1 and N0 houses) the YEAR before the epidemic, for other purposes, But IF we didn’t
have that survey, Snow could go on foot to 100 randomly selected houses (without
regard to whether it contained one or more cases!) or 1000 houses.. Etc
ie he could say take every 50th house.. Then with n1 and n0 houses (1=exposed,
0 not) he could reason
for the exposed, and unexposed we have an
ESTIMATED PT denominator of 50 x n1 (houses) x 7 persons per house x 4 (weeks)
ESTIMATED PT denominator of 50 x n0 (houses) x 7 persons per house x 4 (weeks)
now divide the 286 and the 14 by these two, and take their ratio and what do you
get?
it looks like (AFTER CANCELLATION OF THE 50 and 7 AND 4 )
(286 / n1 ) divided by (14 / n0)
they are 50 times too small, n1 and n0 are only partial denominators
but I the final result it appears as though they are real denominators – they aren’t
of course, so we might call them quasi denominators
For part ii, if I use the following margin of error to calculate the CIs:
exp [z * V(1/a + 1/b + 1/c + 1/d)]
then the CIs should narrow as the sample size decreases, right?
1/a and 1/b don’t change but 1/c and 1/d (and thus the SE of the log) get smaller
as c an d get bigger
so, surely the other way round
Is this the wrong formula?
NO .. And It accounts for fact that you sampled the denominators..
Question 6. We did not learn how to calculate CIs for ratios
by a usual manual way in 607. We did calculate CIs for parameters such as means and
proportions. For those CIs, the general formula was parameter estimate + or - z times
SE.
If I use that method, then I would take the ratio and + or - 2 times the SE for a
95% CI.
In your example on page 2, the SE = 2.5 for the ratio of 13.1. So the CI calculated
using that method would be 13.1 - 5, 13.1 + 5, or (8.1, 18.1), which is very different
from the CI of (5.2, 32.8). Is this the right approach?
the 2.5 is a <<multiplicative>> MARGIN of error, not a STANDARD error
I was unfortunately unable to attend the tutorial you gave this afternoon because
I had an EPI class at that time. Are you available in Purvis anytime tomorrow afternoon
after 1pm?
Yes I will be there... Come see me.. Its indeed better face to face
I am getting pretty confused with questions 5 and 6, and it may be better if I
asked my questions face-to-face. If not, any help you can provide by e-mail would
be appreciated.
I think that part of what is playing out here is that from an epi point of view,
pint estimates are easy and don’t need to involve a statistical ‘model’(its just
intuitive) .. But CI’s do because they require us to calculate variances, and variances
need a model. Of course there are also lots of new stats ideas too, quite apart from
the epi. |
-- wed jan 10 --
For #6 are you implying that the usual way would be
to just add and subtract the MOE for the log rate ratio?
yes indeed.. usual is what rothman suggests work out CI in log scale ie with +/-
MOE. Then exponentiate both limits
Or would the "usual way" be to take the log of the MOE?
Not sure how this would work. Where would one find an MOE to take a log of (pardon
my grammar!)?
For #8 I've done the computing portion, but what is
the formula for an exact CI for the rate ratio? I've read page 29 several times and
do not see any of the formulas state that they are "exact." Pages 137-139
of Rothman also do not mention whether or not the formula provided is "exact."
It is exact in the sense that one can use an exact Binomial CI to get there.
To obtain a CI, we treat the proportion of exposed cases, 0.732, as a binomial proportion,
based on 41 "positives" out of a total of 56 cases (obviously, if the proportion
were based on 8 exposed cases out of 11 cases, or 410 out of 560, the precision would
be very different!)
From table/other source of E X A C T CI's for proportions (see e.g. table on 607
web page), can determine that 95% CI for p is p_L=0.596 to p_U=0.842.
Substitute these for the point estimate to get
RR_L = (0.596 / 0.404) / (28010/19017) = 1.00
RR_U = (0.842/0.158) / (28010/19017) = 3.61
Rothman & Walker emphasize formula RRL,U = {pL/U / (1-pL/U) } / {n1 / n0} over
basis for it.
the point of this test is that by conditioning on a sum of 2 Poisson random variables,
the distribution of one of them is BINOMIAL (its a simple result to prove in a math-stat
course)
so we have gone from 2 dimensions to 1 and the Binomial lends itself to EXACT methods
(We didn’t cover exact Binomial Binomial CIs, as I thought they were covered in 607)
but what is the formula for an exact CI for the rate ratio?
RR_one limit = (P_upper / 1 — P_upper) / (n1/n0)
RR_other limit = (P_lower / 1 — P_lower) / (n1/n0)
with P_lower and P_upper being the EXACT limits based on observing a binomial proportion
c1 / (c1 + c0) in a sample of size (c1+c0) |
-- tue jan 9 --
In question
3 you tell us to concentrate
on one cell (33 and 36.3). In order to calculate a chi-squared stat, do we use 1
degree of freedom? Since we usually would use (r-1)(c-1), to me this gives zero degrees
of freedom but that makes no sense.
These rules sometimes break down.
And chi-sq isn't the only test
Remember chi-sq with 1 df is just square of z. Why not z = (o-e)/sqrt(E)
That's legit.. O given E is Poisson but close to Gaussian (36.3 is large)
Now square it and you get z-sq = chi-sq(1) = (O-E)^2 / E .......Familiar?
A way to understand these is to look at my notes from 607, where I speak of 2 x 2
tables, 2 x1 tables, 1 x 2 tables and even 1 x 1 tables (The one we have here) See p 9 of my notes on chi-sq
for 607 (all my notes for 607 are
on the 607 website)
moral: (in second term) there more than one way to look at 1 number and the null
value its compared with.. Same with larger tables!! Even in first term, several ways
to skin a cat.. I never like teachig chi-sq for a comparison of 2 random proportions
or numbers with each other, or one with a fixed value, because z-test does same think,
and doesn't loose the sign .. Squaring does.... |
-- tue jan 9 --
For question 5, regarding John Snows rates and intervals, I
actually found the opposite trend to what you said we would find.
My CIs for 1000 were more narrow and for 100 were even narrower. This could be because
of the forumula I used for the
SE(ln(RR)= sqrt (1/a + 1/b - 1/N1- 1/N2)
* I considered the rates to be risk ratios and not odds ratios or incidence rate
ratios (e.g. we are looking at number of cases/ no at risk and this is a risk ratio
is it not?)
NO, even with full denominators, these
are not risks. The full denominators are (people in this many house-holds x 4 weeks)
And so if we take say 7 persons per house, (as Snow had to do until he got numbers
of people later) We have approx 40k x 7 x 4 person weeks and 26K x 7 x 4 person weeks
By your logic, if we took a sample of 25 houses, we would have binomial denominators
of 15 and 10, and NEGATIVE variances because one denom (10) is less than the numerator
(14) But in fact a denomiminator sample of 25 yields a proper estimate, albeit a
very imprecise one because in addition to the fact that the 286 and 14 are random,
now the 10:15 estimate of the ratio of exposed:unexposed PT is very imprecise to
start with. Think of the 1/c + 1/d as the price for partial denominators, rather
than full ones ie if c->40K and d->26K, these 2 extra prices would become very
small.
In fact, think of c and d as substitutes for the 2 person-times. And note that even
though ad/bc has a classic odds ratio form, in fact its ' true structure' is (a/partial
PT demom) / (b/partial PTdenom). And c and d are not even in the same units as a
and b!!
i.e. think of it as
ad/bc = (a/b) / (c/d) is AN ESTIMATE of (a/full PT demom) / (b/full PTdenom) and
(a/full PT demom) / (b/full PTdenom) is an estimate of the true RATE ratio
Out of interest, do you know how Snow established which 286 of the 300 in the numerator
lived in houses supplied by the Southwark and Vauxhall Company and which 14 by the
Lambeth company?
* I can see how the CIs would become
wider if we were considering ORs where SE(ln(OR)) = sqrt (1/a + 1/b + 1/c + 1/d)
which would become larger for smaller denominators (100 instead of 1000) etc. But
again, the rates we are given are they not attack rates?? Not incidence rates or
odds ratios? If so, do we not use the CI= e{ln(RR) +/1.96 * sqrt(1/a +1/b - 1/N1
-1/n2)??
see above..
Modern epi is at once simpler, and at the same time more subtle than what we have
to learn from most epi texts
In Unit 7_B (524-207G) Introduction to Epidemiology Sept 03-06, 2002 in bold You
will see a very modern depiction of the real purpose of 'controls' in a c-c study
.. Even though R+G has a reputation as a tough book to read, the intro chapters are
quite straightforward, and I picked excerpts from them to show to the med students
4 years ago !! |
-- tue jan 9 -- For problem 4,
can you please tell me the proper code to calculate ORs and CIs by using R? I've
been trying to figure it out by using page 25 of the epitools manual but I can't
get it to work. I do not understand how to enter the 2x2 matrix as data.
PAGE 27 (see also some info on page 26)
=== for RateRatio ==
epitab(matrix(c(41, 15, 28010, 19017),2,2)[2:1,], method="rateratio", verbose
= TRUE)
Even though Ayas et al call it an OR, I to think of it as a rate ratio. This is after
all a cohort study!
=== for OR ==
epitab(matrix(c(41, 15, 28010, 19017),2,2)[2:1,], verbose = TRUE) |
--mon jan 8 -- I
have some questions about the 634 assignment. I read your post about #3 but I still don't understand where the O=2 comes from in
the eg. of p.24... why not keep 1 or take 3....??
if O were 1, then to have an SIR of 3.5, the CI would be CI[based on 1 case] / E,
with E = 1/3.5. It gives too wide a CI.
O=3 (and associated E) gives too narrow a CI
In question 6, we have to hand-calcute the CI for some odds ratio by our USUAL manual
way (not the multiplied-by/divided-by method)... I though this was the only way to
do it with odds ratio (this is in fact the only method used by Rothman)!!!
there are several ways .. if use an unconditional approach .. (used by Rothman) if
use a conditional approach, different again.
So maybe I just don't understand what we can treat as constant, E, in this case.
Can we say that O=#exposed cases is Poisson while the rest of the ratio is E, ie
denom.exp/(#unexp.cases/denom.unexp)=E and is constant?
nothing to do with E here.. its a comparison of 2 rates, with known PT denominators
for each one (so PT 's are the 2 constants) but two samples are closer in size, not
like the alberta one where the exposed sample is 1000 times smaller than the other.
in q 6, for Snow's data, the denominators are houses, not people (so like person
time in a way, or at least proportional to it) but we presume that the numbers of
houses are KNOWN and not the result of some estimation (hard to believe in 1850s
they would be able to be so exact... today we could ask who gets cable from videotron,
and who from some other company, but then ...?) |
For problem #3, can you tell me which formula that the authors
used to calculate the 95% CI's for the SIRs? I am going by the footnotes on page
24 that the CI is the CI for the observed divided by the expected, but I can't reproduce
their CIs for the expected 1-3 groups by any of the formulas I have tried in the
handouts.
I asked one of the authors (S Suissa) and he said another author G Hill did it, but
its a long time ago and I no longer have the report, so I can't say for sure.
One thing (if yours are NARROWER than theirs) is that they might have incoroporated
the fact that the reference area is not THAT much larger than the index area, and
so there is some uncertainty in the E too.
But it could also be that they used yet another method again. we have two issues
(exact and approx) and also whether to use mid-p or conventional p, etc..
Also, for test part of the same problem, will you let us simply conduct the test
using the confidence interval?
Sorry, I did indeed couch it as a simple yes-sig with p < alpha, or no p >
alpha. But please calculate a p-value from 1st principles , since journal referees
and bosses typically want to see a p-value. {Note that I said A p-value, not THE
p-value.. again there are exact and gaussian-approx p-values, and those that use
mid-p (they count 1/2 the probability of the observed, plus all of the probbailityies
for those more extreme values) or conventional p-values (they count ALL of the probability
of the observed) -- I have notes on this on my 607 website.. will try to locate them}. |
--sun jan 7 -- Assignment
1, q3 [ CI for SIR or SMR ]
On Friday last, I spent more time that I had planned to go over the Poisson distribution,
and the idea of exact confidence intervals, so I didn't get time to remark on one
type of 'epi data analysis' that Rothman does not cover fully in his simple analysis
chapter, namely tests/CIs for a SIR or SMR. (q 3 on the first assignment). He covers
standardization in his next chapter, but uses an example where the 2 compared amounts
of experience are not that different in size. When one of the two is much larger
than other, statistical inference is simplified, and that's why I have it in the
this, rather than next week's assignment.
The example on page 24 of the notes I handed out shows an example [ which we had
to 'reverse-engineer']. The important point is that the CI reflects only the uncertainty
in the O ie the top of the ratio, since the bottom (the E) reflects a scaled down
version of a much larger experience. One useful way to look at it is to say that
the SIR of 3.5 in that e.g. is really a rate ratio of the classical a/PT vs b/PT
type you will deal with in q's 4(i), 5 and 6, but with a << b.
In this example, if we know assume that the Rest-of-Ontario-Children-Years 1964-1985
are say 1000 times the Douglas-Point-Children-Years (DPCY) for the same period (they
are probably more than 1000 times larger!) , then we have 3.5 = (2/DPCY) / (570/
1000DPCY). So in Stata you can put: iri 2 570 1 1000 to get the CI. To convince yourself,
also put in iri 2 5700 1 10000 . Even iri 2 57 1 100 isn't that far off -- as long
as the number of cases in one rate is much larger than the number of cases on which
the other is based, the uncertainty in the rate ratio (CIR) is virtually all from
the rate based on the smaller no. of cases.
Note : if you are going the (free) 'R' route , the command in the Epitools is just
like iri in Stata, but with opposite order for the compared categories ie
library(epitools)
rateratio(c(#unexposed cases,#exposed cases,amount of unexposed PT, amount of exposed
PT))
Besides their different layouts, you will also find some statistical discrepancies
b/w Stata and Epitools, because they use slightly different principles. I will report
back to you on which I think it is just that, or if one of them is also more trustworthy.
For CIs for an SIR, can always take the CI from the appropriate "O" row
of table on p 27, then simply divide it by E. And for exact p-values for tests, one
can use the (handed out, and on the website)Table of Poisson probabilities (or the
dpois function in R, or the POISSON function in Excel), or -- when the mean or expected
number is large enough -- the Normal approximation. |
--sun jan 7 -- Q
(raised in class )
Is it true that PoissonProb(9 events | mu=10) = PoissonProb(10 events | mu=10)?
A: Yes
From 1st principles..
PoissonProb(9 events | mu=10) = exp[-10] x (10^9) / 9!
PoissonProb(10 events | mu=10) = exp[-10] x (10^10) / 10!
The extra power of mu (10 instead of 9) is offset by the extra term (10) in the 10!
in the divisor
It also goes to show why the Poisson distribution is never 'quite' symmetric, even
if mean is large; if mu is small, distribution has quite a long right tail
Try to avoid the terms left-skewed and right-skewed, because half the world interprets
left skew one way and the other half interprets it the other way! If you say long
right or long left tail, everyone will be on same page with you. |
Since you took the time to look, here is a
link with a few statistics jokes. I like the last one, about the statisticians and
the epidemiologists on the train.
http://personal.ecu.edu/wuenschk/humor/StatHumor.htm |
|