BIOS601 AGENDA: Thu Tue Tue Thu 01 06 08 13, 2015
[updated Sept 28, 2015]
Agenda for October Thu Tue Tue Thu 01 06 08 13, 2015
- Discussion of issues
in JH's
Notes and assignment on
Likelihood estimation
Answers to be handed in for:
(Supplementary) Exercises 3.1, 3.2, 3.3, 3.4, 3.5, 3.7, 3.8, 3.9
Remarks on Notes:
These notes were developed to supplement the Clayton and Hills chapter,
which was aimed at epidemiologists, and which does not give the
derivations (the 'wiring' and 'theory') below the results (the user's view of the car).
Even if you have, in your previous mathematical-statistics courses,
covered Likelihood, Maximum Likelihood (ML) and
Maximum Likelihood estimators (MLEs), it is unlikely
that they were presented to you in the same way as JH introduces them
here through this series of exercises. These exercises are designed to
reinforce the point that a likelihood is (proportional to) the probability,
itself a function of theta, of observing the data or outcomes we did observe,
and that the component probabilities in the product (we sum their logs
to obtain an overall log-likelihood)
are not always of the simple form
pdf(y). Many of JH's examples involved 'binned' data, where the corresponding
probabilities can not be accurately approximated by rectangles, ie by
pdfs at the data-values (the y's) multiplied by a delta-y. Instead, they
are calculated as integrals. And in some cases, as when we have right- or
left-censored data, the integral can be over an open-ended interval.
A big attraction of the Likelihood approach is its ability
to obtain information on a parameter by
combining information of different types.
The exercises are also designed to reinforce the point that
not all
MLEs have a closed form.
JH will (and you should) try to distinguish between
estimator
(a procedure, formula, etc) and
estimate, the numerical result
of applied this estimator to data. Of course, the
abbreviation MLE stands for both the estimator and the estimate.
Remarks on assigned exercises .
3.1 Daniel Bernoulli's example.
See Notes. Apart from the intent to bring in a bit of history, this is also and example
where someone has worked out the likelihood and maximized it by brute force, but you
can simply use R. From what JH gathers from others, optimize is recommended for 1-D problems, and
optim for higher dimensional problems.
3.2 Fisher's example of measurement errors.
The measurement errors are grouped into classes (bins).
Usually we use bins of our own choosing, for convenience,
e.g., age bins 15-20, 20-25, etc,
or annual income brackets $20K-$40K, 40K-60K, etc... sometimes open ended.
What's different here is that, because Fisher is dealing with a 'symmetric about 0
Normal distribution', he puts the errors between -1 and 0 in the same bin as
those between 0 and 1;
likewise he puts the errors between -2 and -1 in the same bin as
those between 1 and 2. He is cutting down on his
numerical work, using absolute errors,
since the expected proportion in a 'left of 0' 1-unit wide bin should be the same
as the proportion in its 'right of 0' counterpart.
In this example, the frequency data are multinomial,
so there is one joint probability,
involving the usual product of probabilities, each one an integral.
Its log will be a sum, but with some dependence.
But, in other cases
(e.g., the dilution experiment, or meta-analysis),
we merge likelihoods from independent datasets or observations.
Usually, when we learn likelihood, each bin is very narrow
(eg someone's height or weight is to the nearest cm or Kg)
so the prob. mass in the bin is closely
approximated by the pdf(mid-bin) x the width of the bin =
pdf(y) x dy.
But this is not the case here; while you might get a
reasonable approximation to some bin probabilities
using the area in a rectangle, it won't work very well
for others, and it won't work at all for the (effectively)
open-ended bins.
I expect you will realize that there is no closed-form expression
for the MLE of sigma-squared or sigma, so you can instead
plot the Log-likelihood and visually find the MLE.
You might find the log-likelihood involving log(sigma-squared)
better behaved than the log-likelihood involving sigma-squared itself.
And, of course, you can search in the sigma or sigma-squared scale.
Later on, we will use more sophisticated numerical ways to
find the MLE, and at the same time to calculate the curvature,
and thus, the precision of the estimate. This iterative
method is especially valuable when the parameter is multi-dimensional.
For 3.3 (Galton's data on the speed of piegons), most bins are 100yards/min wide,
but the '-500' bin
extends all the way from 0 to 500 (technically, since we are dealing
with a N(,) distribution, from -Inf to 500), but all of these
racing pigeons, even the older ones, have speeds well above 0.
Even though they do not involve time (the subject of
survival analyses),
the data in exercises 3.2 and 3.3 are in fact
'interval-censored'.
In 3.4 (déjà vu) the 2 datapoints are independent,
so the log-lik is a sum of two independent log(probability)'s. In this
case, you already obtained a closed-from estimator. One of the points of this
question was to emphasize that
for ML estimation, one needs a fully
specified distributional form for each observation. In contrast,
the LS estimator did not require a distributional form for
the errors about the 'line of means'; indeed, the LS
estimator is a purely numerical line-fitting
approach, and doesn't rely on any statistical assumptions. It is
only if we wish to describe the statistical behaviour of the
LS estimator that we need to specify a complete model.
The observations in
3.5 and 3.6
are examples of censored 'survival' data.
One of the points of JH's choice of the 'tumbler' longevity data
is that is shows that there are (at least)
two equivalent ways of
deriving a likelihood.
* One can
regard the frequencies as the realization of a single
multinomial r.v., just as in examples 3.1 and 3.2.
In this approach, all of the 'bin' probabilities
must add to 1, so we can think of it
as an 'unconditional' approach.
* By focusing on week-specific 'failure' rates
or 'hazard' rates --
one can proceed week by week, and treat each week
as a separate (conditional) binomial.
In this approach, each conditional probability
and its complement add to 1 within each week,
but across weeks, the week-specific failure probabilities
are not constrained to add
to 1.
The data in 3.6 have a very special structure:
every
observation is either right-censored (if
the person is pulled out alive)
or left-censored (if pulled out dead).
The same kind of 'current-status' data arise
if we conduct a maternal-and-child survey, and ask
(i) the infant's age and (ii) whether the
infant is still being breast-fed, and wish
to convert the responses into a "still-breastfeeding'
curve that shows, for each week or month of age,
what proportion of infants are still being breast-fed at that time point.
In the ML approach, each likelihood contribution
is the integral of the pdf from either 0 to
't' or from 't' to Inf.
As we will discuss further in class, it is not easy
to come up with a single sensible pdf for the avalanche
data, since there are at least 3 separate causes of death,
such as trauma, asphyxia, and hypopthermia, and
the dataset does not distinguish them. But, to keep
it simple, for the purposes of this chapter, we will
adopt some a single 1- or 2-parameter distribution.
In
3.7 note JH's suggestion
to re-parametrize the proportions so that they always stay within
their bounds.
3.8 Dilution series is a very good example
where we do not have any obvious and easy estimator, the way we often do for
LS. This was one of Fisher's very first (and ver compelling) uses of ML estimation.