BIOS601 AGENDA: Wednesday September 13, 2017
[updated Sept 05, 2017]
Agenda for September 13, 2017
- Discussion of issues
Notes and assignment on mean/quartile of a quantitative variable
Answers to be handed in for:
Exercises 0.1, 0.3, 0.5, 0.12, 0.13, 0.14
(JH omitted 0.6, used for several past years, since it is similar to one of the 3
new questions (0.13-0.15) added recently. Can you tell which of the new ones
it is similar to?)
Remarks on Notes:
These notes are based on those he developed for the course Principles of Inferential Statistics in Medicine,
which JH taught to incoming students in the epidemiology graduate program
1980-1993. As such,
they emphasize the 'end-product', rather than how the product was arrived at.
In bios601, we will emphasize both.
Section 1 Notice the focus on the shape of a very specific type of distribution,
or random variable,
namely that based on a sum/mean/combination of
several (usually n) other (usually i.i.d.) random variables.
This shape is often very different from that of the 'component' random variables.
You might say that the 'component' random variables have distributions
generated (made) by nature, whereas those generated by aggregation of samples
are 'man'-made i.e., 'researcher'-made distributions, with shapes determined
in (a small, or even negligible) part by the shape of the
distribution of the individual r.v.'s, but in
( a large, and dominant) part by the
number of, and degree of independence between the
individual r.v.'s aggregated.
Notice also, in section 1.3, the different terminology used for the standard
deviation of an individual component r.v.,
error (SE) of the aggregate r.v. Also, an SE
typically involves a 'plug-in' estimate, and always (even if it is not explicit) a
1/sqrt(n) multiplier of the 'unit' variance. Comfort with, and appreciating the
central role of SE's is a prerequisite for work in applied statistics. When teaching the
epidemiology students, JH used to say "the SD is for the
variation of individuals;
the SE is for the (sampling) variation of a
Think of a statistic as an observable quantity calculated from
from a sample of n observations. Think of a
parameter as an unobservable (but estimable) quantity
relating to a physical population,
such as the average or median depth of the ocean, or concentration of radon in
Canadian homes, or (if the object
of inference is an individual person) the average amount of that person's
physical activity , or level of blood pressure, or time spent indoors,
over a year.
or an unobservable
(but estimable) constant of nature, such as the speed of light, or the ratio
of the volume of a sphere to the cube of its radius.
DIGRESSION on historical (statistical) uses of the mean
Some material from a new,
entertaining, and interesting book -- by a master of storytelling.
The notes (and particularly the examples) make
the number of individual r.v.'s aggregated
has a central role
in the shape of
its sampling distribution. The centrality and importance of the Central
Limit Theorem (CLT) in applied statistics cannot be over-emphasized.
In theoretical statistics, it is often presented as a mere
mathematical result, and seldom (as 'Student' said in a different context)
is one given any sense of when (i.e., at 'n') the law 'kicks in.' JH has the impression,
from some PhD theses of statistics students he has examined, that
the 'holy grail' is simply to establish asymptotic normality,
with no consideration as to at what 'n' this is an acceptably-accurate
approximation, or whether some change of scale (e.e., log or logit)
might not make for a better normal approximation.
He vividly remembers one PhD student, who, in his effort to
prove asymptotic normality, used a Lemma that stated that if the log of
the r.v. goes to Normal, so must the r.v. itself.
What the student and supervisor hadn't bothered to discover (but what a few plots
would have readily shown)
is that in the case in point, the log of the r.v. goes to normal
faster than did the r.v. itself!
Unfortunately, it takes a bit of experience
watching the interplay ('battle') between the degree of
non-normality of the individual (component) r.v.'s and the size of n,
to develop some intuition and apprectation for the 'n' at which
the CLT kicks in. The point is that it is context-specific:
with well-behaved individual (component) r.v.'s, it happens well before
the n=30 threshold that many courses teach. Indeed, when doing simulations,
one can subtract 6 from the sum of just 12 U(0,1) r.v.'s
and use the result as an acceptably accurate
accurate enough for government work)
N(0,1) r.v. Sometimes
(as in the case of the 'insurance premiums' example, it takes an
n in the 100s or 1000s or more.
Thus, it is worth exploring as many
of the CLT examples, and the simulations, as you have time for.
If you are really short of time, and can only at one, look
at 'The Central Limit Theorem in Action' in the Graphs Figures Tables
Computing section of the
Resources website. JH used this example (but without numbers) when he was
co-teaching with, and handing over 607 to, Lawrence Joseph, and Lawrence
(and his very artistic wife) gave it numerical and artistic expression.
The example also shows that
the CLT is not limited to sums or means of i.I.d. rv's,
but also to i.NOT-I.d. rv's. The key property is the first
'i', the INDEPENDENCE,
since it is the independence that helps in the
Law of Cancellation
of Extremes (JH's name for the CLT!).
Nothing very remarkable here (mostly
manipulating formulae!( except to say that on re-reading these
Notes, JH would now add a qualifier that the formulae
to work correctly only for independent (or at a minimum!) uncorrelated
For a good example of 'several estimates of a single parameter'
think of several independent estimates of the speed of light,
or in the Cavendish example, of the density of the earth.
In this case, the weights should be precision-based.
Student's problem was not about the n at which the CLT kicks in
[he was already assuming the component r.v.s' are N(,)]
but about when the sample standard deviation (s) is a good substitute (proxy) for
the 'true' but unknown 'population' standard deviation (sigma):
"no one has yet told
us very clearly where the limit between
'large' and 'small' samples is to be drawn."
In connection with the 100th anniversary of Student's
ground-breaking 1908 paper, JH and colleagues went back to
that paper, and to the way he mathematically derived
what is now called the 't' distribution.
Full details, including why Gosset called himself 'Student',
and his simulations to check out his shaky algebra,
can be found at Article/Material in connection with 100th
Anniversary of Student's 1908 paper.
The 2 persons in the photo
with JH at the reception at the Guinness Brewery in Dublin in 2008
are the grandson and granddaughter of William Gosset ('Student').
At the unveiling of the plaque, the grandson told us that he
was pretty sure he alone, of those assembled in 2008, had personally
met Gosset. He was 6 months old when he was brought in to
the hospital to see his grandfather, a few months before the grandfather died
in 1937 in London (Gosset was English-born and educated, but worked
for Guinness in the Dublin HQ from 1899 onwards. In 1935,
he moved to London to take charge of the scientific side of production,
at a new Guinness brewery at Park Royal in North West London,
but died just two years later, at the age of 61.
Interestingly, that 1908 paper was of limited use, since it dealt only
with 1-sample problems. It took Fisher's insights in the 1920s
to generalize it to not just 2-sample problems, but also
correlation and regression, indeed to any context where one was dealing with
a ratio of a mean or correlation or slope to its standard error;
in turn, the SE involved the sqrt of an independent
plug-in estimate of the unit variance. Fisher called the no.
of independent contributions to that estimate the "degrees of freedom".
In this context, JH usually defines the "d.f." as "the
number of independent estimates of error": think of the
number of independent residuals (which one "pools" to
get one overall estimate of sigma-squared) as a case in point. It
is no different in spirit from pooling the squared within-group
deviations from their own means
[they are also residuals, from each fitted (ie group) mean].
In "Another worked Example, with graphic", JH is trying to get
statisticians and their collaborators to use a better way to display
paired data: the usual presentations involve separate SE's for the two
means, as though the one mean was from one sample of n, and the other
from and entirely separate (independent) sample of n.
Other years, we left sample size and precision issues
until later in the course, where we planned deal with
them 'en masse.' But many years we never had the time
at the end of the course. So this year, following on from
the calculations you did
with the step-counter data,
Q 0.11 will you to visit this section, and Figure 4 in particular.
Remarks on assigned exercises .
0.1 (Head sizes) Beginning biostatisticians often find it
difficult to give a rough guess for what the SD should be. In the case
of heights of adult (fe)males,
he asks them to think of say the middle 95% of the distribution, i.e., from
someone 'quite short' to someone 'quite tall', then to equate this range
with 4SDs and back-calculate. For neck or waist or head sizes, do likewise.
A favourite question that he uses on interviews for statistical
research assistants is to give the candidate a number for the SD and ask if it seems
reasonable. For example, does the range of
left middle finger lengths
in Table III of the excerpt from Macdonell's (1901) report seem reasonable to you?
Remember, you have seen a lot of people, and their fingers, over your lifetime,
so you are (or should) be an expert in this topic, and be able to
come up with a rough SD, and in the correct units!
Remember, you have seen a lot of people, and their fingers, over your lifetime,
so you are (or should) be an expert in this topic, and be able to
come up with a rough SD, and in the correct units!
In the case of head sizes, think of hat sizes.
And if you like, don't think of an absolute SDs, but of a SD as a percentage
of the mean. For height, what would it be? For weight (over which
people have more influence!), would it be a bigger percentage?
Would it be a bigger percentage for finger length or height?
What would this percentage (also called the coefficient of variation, or 'CV')
be for the head sizes, measured in utero by ultrasound, of babies
of 13 weeks' gestation? 26 weeks' gestation? 40 weeks' ? The much smaller Cv at 13 week is used
to 'date' the pregnancy, ie to provide a good estimate of the gestational age: after
13 weeks, head sizes become more individualized and variable.
One last thing: if asked,
many researchers (and even many not so young biostatisticians!)
would confidently answer that the SD of the heights of 100 randomly samples individuals
would be bigger than the SD of the heights of 10 randomly samples individuals.
Many others would confidently answer that it the SD of the 100
would be smaller than the SD of the 10. In fact, it is nearly impossible
to consistently predict which would be greater than which -- try it out using R!
A common but wrong reasoning is based on the fact that the sample SD involves
an n-1. But in fact, the squared SD is nothing more than an average squared
deviation, so the n (or n-1) has virtually nothing to do with it.
If sampling from a N(,) distribution,
the sampling distribution of a variance ratio, with df 99 and 9 in our e.g., is
not quite symmetric -- but the median ratio is 1!
The other common mistake is the same one made by Epstein -- mixing up SD and SE!
A population SD is determined by nature -- think of the SD of the diameters
of all of one's red blood cells. The SD of the millions of these cells
is not determined or influenced by a researcher -- the SD is a property
of the owner of the cells! The sample sd that the researcher sees in his/her
sample of n cells is just an estimate of the owner's SD. Clearly,
if the n is larger, the sample sd is more reliable, and doesn't tend to fall as
far from the true SD as the sample sd of a smaller sample would.
0.2 Births after The Great Blackout of 1966: You don't have to
hand in answers for this exercise, which JH used to emphasize
(i) day to day variability (ii) systematic differences between weekdays and weekends
(iii) the fact that it takes a big population, with lots of births, to reliably
see the 'signal' ie the difference between weekdays and weekends. T hink
of the weekday variation as driven by Nature, and the weekend pattern
by physicians who want to have their weekends free, or would prefer
that women deliver during the week when the hospital is more fully staffed (iv)
the fact that the weekday variations are already close to Gaussian, so
the sampling distribution of the men of several days can be acceptably
approximated by the Normal without needing a big 'assist' from the CLT.
The NYC blackout led to an amusing 'urban myth' but getting
a sound answer to the
question posed in part (iii) concerning the Quebec government incentive
to increase the birth rate isn't quite to easy, and would take some
ingenuity -- that's what makes epidemiologic research (with its limited
opportnities for experimental control) more challenging and more
0.3 (Planning ahead)
This 'JH-homegrown' question focuses on an important link between
the tail area of the Poisson distribution and the (opposite)
tail are of the gamma (or chis-q) distribution.
Before the universal access to statistical packages we now enjoy,
this link was important for computational reasons: one could use
tabulated tail areas of the chis-sq distribution to directly
obtain Poisson tail areas.
1935 derivation of this exact (and at first, surprising)
link between the tail area of a discrete r.v. and the (opposite)
tail of a continuous r.v. is not that easy to follow.
If you want to 'see' the link more clearly, look at the form of the P value in
Illustration I p 168 or
V p 170 in Pearson's
classic 1900 paper on the
chi-sq goodness of fit statistic.
The main purpose of the exercise is to get you to
derive the link in an applied setting, by pure thought, rather than by
blind algebra. The link between the continuous and discrete rv's
is that the distance is a finite sum of distances, and the (discrete) number
is the number of failures/replacements before a given distance is achieved.
The second purpose is to sneak in the CLT. The more items in the
sum, the closer the gamma is to Normal. And, since the individual components
in the sum don't have anything near a symmetric, it takes a good few items in the sum
to get the sum to 'forget' what its (many) parents are!
If the individual components had had a distribution where the mode was
not at the boundary, the CLT would 'kick in' at a lower n.
If you read French, or even if you dont but would like to admire what we think is a
diagram, take a look at
teaching article which we prepared for a Math magazine that goes
to all Québec high school students. Instead of by car on land, its a 10 year
journey into space, where critical items fail, and must be replaced, if the
the mission is to succeed.
The diagram (pp 28-29, designed to fit
across 2 pages if printed) was a recent inspiration, but this story
has been used in bios601 from the beginning.
0.4 Selecting mice at random is not that easy!: You don't have to
hand in answers for this exercise, but it (again) emphasizes the CLT.
How accurate it is depends on the distribution of the individual
weights sampled from. Given that these are lab animals, all with the same parents,
and the same boring food, it probably has a central mode and is
close to symmetric (it would be less so with free-living humans, but even then
the n=30 would easily counteract the asymmetry.
0.5 (Planning ahead, II)
This relates to the same topic of asymmetry, a factor that slows down the march
towards a Normal distribution (of the sum or mean) is the lack of
symmetry of the individual summands. Another slowing factor is any lack
of independence among the summands.
0.6 Shape of waiting time distribution:
Using the shorthand 'DU' for 'Discrete Uniform', the wait is DU(22-26) minus
If the throw on die_1 (die is singular of dice) could be DU(1-6),
and the throw on die_2 also DU(1-6), what would be the shape of the
distribution of die_1 + die_2? die_1 - die_2 ?
What if had die_1 +/- die_2 +/- die_3? What if had continuous rv's :
U_1(0,1) +/- U_2(0,1)?
U_1(0,1) +/- U_2(0,1) +/- U_3(0,1) .. shape for sum/diff of 3 continuous U's
is smooth, whereas shape for sum/diff of 2 continuous U's has a sharp mode.
0.7 Snail's pace: Given the 'topic of the day',
you can probably guess the answer to part iii. Without the CLT,
you wouldn't get very far just using Tchebychev's theorem!
0.8 Student's t distribution: To 'Student',
the calculation of the exact tail area would have been quite laborious,
so he stopped at n=10, or (even we had to wait for Fisher to introduce the
concept) degrees of freedom =10. Note in particular Student's (mis)use of the p-value,
confusing its with probability that the null (alt.) hypothesis is false (true).
Notice also his (correct) use of odds [prob:(1-prob)] rather than probability.
0.9 Cavendish's measurements of the density of the Earth: You don't have to
hand in answers for this exercise. His
arithmetic error (his reported
mean does not agree with the mean of the 29 reported
measurements) went unnoticed for quite some time.
It is also interesting to
see how early scientists tried to convey some sense of a 'margin of error.'
0.11 Sample size calculations:
Since the issue of bias has already been raised, it is natural to tie
it to some practice with sample size calculations.
Sadly, most people now used 'canned' programs to do these, even though,
as this exercise is designed to show, they are easy to
do 'from scratch', via first principles, once one draws a diagram
and sees what is involved. If you don't like the example of
detecting a machine that is set to underfill soft-drink bottles,
maybe you will like the one that the Moore and McCabe
textbook used to use, albeit it is worried about
an alternative where the mean is higher than it should be:
15.17 Is this milk watered down?
Cobra Cheese Company buys milk from several suppliers.
Cobra suspects that some producers are adding water to their milk
to increase their profits.
Excess water can be detected by measuring
the freezing point of the milk.
The freezing temperature of natural milk varies
Normally, with mean mu = -0.545 Celsius (C)
and standard deviation sigma = 0.008 C.
Added water raises the freezing temperature toward
0 C, the freezing point of water.
Cobra's laboratory manager measures the
freezing temperature of (n = ) five consecutive
lots of milk from one producer. The mean
measurement is ybar = -0.538 C.
Is this good evidence that the producer
is adding water to the milk?
State hypotheses, carry out the test,
give the P-value, and state your conclusion.
Moore and McCabe don't ask you to calculate the sample
size you would need to catch a cheater who was adding
(a) just a small amount (b) a large amount) of water to the milk.
Clearly if the amount raised the freezing point to
-0.540, it would be much more difficult to catch this than if
the amount raised it to -0.535.
You might find a sample size exercise based on detecting cheating
more realistic than the one
of the same structure (but mirror image) in Q 0.11.
0.12 Bootstrap Investigation of Sampling Variability of an estimator:
The bootstrap was developed to quantify the
`difficult to study analytically' behaviour of estimators,
and so it suits the purpose here: rather than you relying on your (or my) intuition
that 30 is enough or 200 is enough, you can effectively simulate the variation.
Of course, it would have been a tip off that the large-sample interval estimate
was inappropriate if the CI based on the Gaussian model
(CLT based) included a negative mu! Interesting this is the first year
where I have included this question, and it was only because a classmate of
yours raised the possibility of doing so. So, I am never to old to learn,
and I still consider myself -- like Gosset - a 'student'.
0.13 Planning ahead - the (2015) sequel: JH found himself
thinking probabilistically on Orientation Day 2015, when he had to leave the lunch early to get
to his appointment.
0.14 Laplace, before computers: You have
to both admire and sympathize with Laplace, who could come up with
an exact, but at the time entirely impractical, probability formula, which got worse with n.
But his 1810 approximation for the distributions of sums of n iid rvs from any behaved family (not just uniform)
got better the greater the n.
0.15 Dice for Statistical Investigations: Another
enterprising, but much more practical (and self-taught) statistician and polymath.
Maybe if we get access to a 3D printer, we can build a pentakisdodecahedron
(a golf ball would be ok too, but it has too many 'dimples' and they are to small
to write the values into them!).