BIOS601 AGENDA: Mon Oct 2, Wed Oct 4, Wed October 11; 2017
[updated Sept 30, 2017]
Agenda for Mon Oct 2, Wed Oct 4, Wed October 11; 2017
- Discussion of issues
Notes and assignment on
Answers to be handed in for:
(Supplementary) Exercises 3.1, 3.2, 3.3, 3.4, 3.5, 3.6(PhD only),
3.7, 3.8, 3.9, 3.10(PhD only), 3.12, 3.13(PhD only).
Remarks on Notes:
These notes were developed to supplement the Clayton and Hills chapter,
which was aimed at epidemiologists, and which does not give the
derivations (the 'wiring' and 'theory') below the results (the user's view of the car).
Even if you have, in your previous mathematical-statistics courses,
covered Likelihood, Maximum Likelihood (ML) and
Maximum Likelihood estimators (MLEs), it is unlikely
that they were presented to you in the same way as JH introduces them
here through this series of exercises. These exercises are designed to
reinforce the point that a likelihood is (proportional to) the probability,
itself a function of theta, of observing the data or outcomes we did observe,
and that the component probabilities in the product (we sum their logs
to obtain an overall log-likelihood)
are not always of the simple form
pdf(y). Many of JH's examples involved 'binned' data, where the corresponding
probabilities can not be accurately approximated by rectangles, ie by
pdfs at the data-values (the y's) multiplied by a delta-y. Instead, they
are calculated as integrals. And in some cases, as when we have right- or
left-censored data, the integral can be over an open-ended interval.
A big attraction of the Likelihood approach is its ability
to obtain information on a parameter by
combining information of different types. Another
is the fact that the process of finding the MLE numerically involves
the curvature of the log-likelihood function, and this provides a direct way
to calculate the estimated (co)variance of the estimator..
The exercises are also designed to reinforce the point that
MLEs have a closed form.
JH will (and you should) try to distinguish between
(a procedure, formula, etc) and
estimate, the numerical result
of applied this estimator to data. Of course, the
abbreviation MLE stands for both the estimator and the estimate.
Remarks on assigned exercises .
3.1 Daniel Bernoulli's example.
See Notes and JH Presentation on 1st Pillar of Wisdom and Fitting of Early Error Distributions.
Apart from the intent to bring in a bit of history, this is also and example
where someone has worked out the likelihood and maximized it by brute force, but you
can simply use R. From what JH gathers from others, optimize is
recommended for 1-D problems, and
optim for higher dimensional problems.
In case it isn't clear from Bernoulli's written description of his 'error function',
here is a picture of it (with r taken as 1). The graph also shows the
log-likelihood calculations at 9 different values of the parameter.
JH has used theta to denote the parameter -- the (unknown) center of the
Note also that in the R code used to produce these calculations and graphs,
the pdf is appropriately 'normalized' so that the area under it is 1.
The area of a semi-circle with radius 1 is pi/2. So the proper pdf(y)
is (2/pi)*sqrt(1 - (y-theta)^2 ) for y within 1 of theta, and zero elsewhere.
As it happens, this constant does not involve theta, and that is why Bernoulli
does not bother with it when trying to find what value of theta makes the
product of the 3 probabilities as large as possible.
Notice also that the density itself is not a probability, and that
the probability of observing a value of exactly 0.2 is zero. The probability
of a '0.2' is really the probability of observing a value that,
WHEN ROUNDED or BINNED, is 0.2.
Assuming the bin width is 0.1 , the probability is close to
the area of the rectangle with height density(0.2) and width 0.1.
Again, like the other constant, the 3 base widths drop out when we
focus on the part of the log-likelihood that involves theta.
While it would have been temping to show the densities at the
3 observed values, JH has instead calculated and shown their 3 logs,
since our focus is alsways on the log-likelihood, and never on the
likelihood itself. Indeed, Fisher's 1912 article did not even show
the likelihood: he started right away with the sum of the logs. Also,
it is easier to add their logs that to multiply the densities. The only downside is
that the sum of the logs is often negative, and so to find where
the log-likelihood is maximum, you have to find where the sum of the logs
is 'least negative.'
Lastly, given this diagram and way of proceeding, you can now easily imagine
replacing it with the one for Laplace's first error model,
or his second one (the same as Gauss's one).
3.2 Fisher's example of measurement errors.
The measurement errors are grouped into classes (bins).
Usually we use bins of our own choosing, for convenience,
e.g., age bins 15-20, 20-25, etc,
or annual income brackets $20K-$40K, 40K-60K, etc... sometimes open ended.
What's different here is that, because Fisher is dealing with a 'symmetric about 0
Normal distribution', he puts the errors between -1 and 0 in the same bin as
those between 0 and 1;
likewise he puts the errors between -2 and -1 in the same bin as
those between 1 and 2. He is cutting down on his
numerical work, using absolute errors,
since the expected proportion in a 'left of 0' 1-unit wide bin should be the same
as the proportion in its 'right of 0' counterpart.
In this example, the frequency data are multinomial,
so there is one joint probability,
involving the usual product of probabilities, each one an integral.
Its log will be a sum, but with some dependence.
But, in other cases
(e.g., the dilution experiment, or meta-analysis),
we merge likelihoods from independent datasets or observations.
Usually, when we learn likelihood, each bin is very narrow
(eg someone's height or weight is to the nearest cm or Kg)
so the prob. mass in the bin is closely
approximated by the pdf(mid-bin) x the width of the bin =
pdf(y) x dy.
But this is not the case here; while you might get a
reasonable approximation to some bin probabilities
using the area in a rectangle, it won't work very well
for others, and it won't work at all for the (effectively)
I expect you will realize that there is no closed-form expression
for the MLE of sigma-squared or sigma, so you can instead
plot the Log-likelihood and visually find the MLE.
You might find the log-likelihood involving log(sigma-squared)
better behaved than the log-likelihood involving sigma-squared itself.
And, of course, you can search in the sigma or sigma-squared scale.
Later on, we will use more sophisticated numerical ways to
find the MLE, and at the same time to calculate the curvature,
and thus, the precision of the estimate. This iterative
method is especially valuable when the parameter is multi-dimensional.
For 3.3 (Galton's data on the speed of piegons),
most bins are 100yards/min wide,
but the '-500' bin
extends all the way from 0 to 500 (technically, since we are dealing
with a N(,) distribution, from -Inf to 500), but all of these
racing pigeons, even the older ones, have speeds well above 0.
Even though they do not involve time (the subject of
the data in exercises 3.2 and 3.3 are in fact
In 3.4 (déjà vu)
the 2 datapoints are independent,
so the log-lik is a sum of two independent log(probability)'s. In this
case, you already obtained a closed-from estimator. One of the points of this
question was to emphasize that
for ML estimation, one needs a fully
specified distributional form for each observation. In contrast,
the LS estimator did not require a distributional form for
the errors about the 'line of means'; indeed, the LS
estimator is a purely numerical line-fitting
approach, and doesn't rely on any statistical assumptions. It is
only if we wish to describe the statistical behaviour of the
LS estimator that we need to specify a complete model.
The observations in
3.5 and 3.6
are examples of censored 'survival' data.
One of the points of JH's choice of the 'tumbler' longevity data
is that is shows that there are (at least)
two equivalent ways of
deriving a likelihood.
* One can
regard the frequencies as the realization of a single
multinomial r.v., just as in examples 3.1 and 3.2.
In this approach, all of the 'bin' probabilities
must add to 1, so we can think of it
as an 'unconditional' approach.
* By focusing on week-specific 'failure' rates
or 'hazard' rates --
one can proceed week by week, and treat each week
as a separate (conditional) binomial.
In this approach, each conditional probability
and its complement add to 1 within each week,
but across weeks, the week-specific failure probabilities
are not constrained to add
The data in 3.6 have a very special structure:
observation is either right-censored (if
the person is pulled out alive)
or left-censored (if pulled out dead).
The same kind of 'current-status' data arise
if we conduct a maternal-and-child survey, and ask
(i) the infant's age and (ii) whether the
infant is still being breast-fed, and wish
to convert the responses into a "still-breastfeeding'
curve that shows, for each week or month of age,
what proportion of infants are still being breast-fed at that time point.
In the ML approach, each likelihood contribution
is the integral of the pdf from either 0 to
't' or from 't' to Inf.
As we will discuss further in class, it is not easy
to come up with a single sensible pdf for the avalanche
data, since there are at least 3 separate causes of death,
such as trauma, asphyxia, and hypopthermia, and
the dataset does not distinguish them. But, to keep
it simple, for the purposes of this chapter, we will
adopt some a single 1- or 2-parameter distribution.
3.7 note JH's suggestion
to re-parametrize the proportions so that they always stay within
3.8 Dilution series is a very good example
where we do not have any obvious and easy estimator, the way we often do for
LS. This was one of Fisher's very first (and very compelling) uses of ML estimation.
3.9 Pooled Testing Another example
where we do not have an obvious estimator.
3.10 Accuracy in Throwing Darts
As JH says in the preamble to the questions, the article
by Tibshirani et al. is a delightful teaching article,
that brings in several topics.
Part 1 of the question brings us back to the sum
of the squares of 2n independent N(0,1) random variables.
So, knowing what its distribution is, it is easy to calculate its expected value,
and to equate this with the sum
of the squares of the 2n observed errors. This is the method of moments.
Here, with all of the squares rolled into 1 sum or mean,
we have one observation (the mean squared error) and
1 parameter (sigma-squared).
So, there is no 'conflict.'
This is unlike our '2 datapoints and a model' example where we
had 2 'conflicting' observations and 1 parameter, so we had to
take some intermediate value as a 'best' compromise
(which intermediate value depends on the model
and the 'loss' function).
If we chose the LS criterion in the
'1 data point 1 parameter' situation of the erros in darts,
then we can make the sum of squares between the
observed mean and the fitted sigma-square to be zero.
Similarly, since we have just 1 observation from a scaled
chi-squared distribution, it is obvious that the likelihood
(or log-likelihood) is maximized by taking the MLE
of sigma-squared to be the
observed mean squared error.
For the second last part of Q1, we can apply the same 'trial and error'
approach we used with the Clopper-Pearson CI for the
parameter of a binomial.
The last part of Q1 goes back to the
chi-sq distribution with 2 df, and to the grouping of each pair
of squared errors to form an exponential distribution.
So the sum of these n sums is then a gamma distribution with
shape parameter n.
The square root of the chi-sq distribution with 2 df has
a lot of practical applications, and even has its own name:
the Raleigh Distribution.
According to Wikipedia,
the distribution is named after
In other years, JH asked students to use the
to estimate the parameter sigma-squared. And indeed, Tibshirani's
main purpose was to show the EM algorithm in action.
Indeed, this darts examples is a very nice example, because
the scores give you incomplete information on where
the dart landed. For example, a score of 18
might have been a single, or a double-9 or a triple-6.
So several landing-segments lead to the same score.
The EM begins by starting with a guessed-at
values for where the incomplete data come from.
Then, using these, the 'E' step works out what the expected value of
the 'complete-data' log-likelihood would be, and then in the 'M' step,
maximizes this to get a new sigma-squared
estimate. The process continues until the estimates do not change.
It might be easier to explain this in the case of Fisher's
binned-errors example. One could start by putting all the
observations at the mid-point of their bins, and then estimate
sigma-squared. One could then use this to compute what the
expecetd value of the 'complete-data' log-likelihood would be,
and what that implies about the locations of the complete data.
You would imagine that it moves them inwards i.e. towards the left boundary
of each bin. The expected values of these are used to
compute a new parameter estimate, and so on and so on.
It turns out that this is not the same as simply moving the
y's to the left: as you will see in the R code, the complete
observations would have contributed to the log-likelihood
through their squares. So in fact we need to know where the
new 'centre of gravity' is, not of the values, but of their squares.
There is a reason JH brings up this Raleigh distribution, because it measures
the Euclidean distance from the centre of the target.
And the rings on the dart board divide this radius into 7
regions, whose probability masses are governed entirely
by the parameter sigma-squared.
We thus have a multinomial distribution with these
7 probabilities; the problem is very like Fisher's
one with the binned errors, except that the
widths of the bins here are set by the person who
designed the dart board (see the Tibshirani paper,
of JH's R code, for the spacings).
3.11 EM for 'heaped' data.
A very nice example of binned (censored) data, and of how easy it is nowadays
to fit various models -- once one is able
to write down the generic likelihood.
3.13 Coarsened ('suppressed') count data.
Same 'old', except that most bins are just 1 unit wide,
whereas the bin with coarsened counts spand the counts from 1 to 4 inclusive.
Here, since there are no covariates about
the different observations,
the amount of computation to be simplified
if -- for the observations with exact counts -- one first adds
up the numerators and the PY denominators,
i.e., collapse the exact observations into a single observation.
However it will
be good practice for you if you have each observation
contribute separately to the log-likelihood.