- Discussion of issues
in JH's
Notes and assignment on
Likelihood estimation

Answers to be handed in for: (Supplementary) Exercises 3.1, 3.2, 3.3, 3.4, 3.5, 3.6(PhD only), 3.7, 3.8, 3.9, 3.10(PhD only), 3.12, 3.13(PhD only).

**Remarks on Notes**:

These notes were developed to supplement the Clayton and Hills chapter, which was aimed at epidemiologists, and which does not give the derivations (the 'wiring' and 'theory') below the results (the user's view of the car).

Even if you have, in your previous mathematical-statistics courses, covered Likelihood, Maximum Likelihood (ML) and Maximum Likelihood estimators (MLEs), it is unlikely that they were presented to you in the same way as JH introduces them here through this series of exercises. These exercises are designed to reinforce the point that a likelihood is (proportional to) the probability, itself a function of theta, of observing the data or outcomes we did observe, and that the component probabilities in the product (we sum their logs to obtain an overall log-likelihood) are not always of the simple form pdf(y). Many of JH's examples involved 'binned' data, where the corresponding probabilities can not be accurately approximated by rectangles, ie by pdfs at the data-values (the y's) multiplied by a delta-y. Instead, they are calculated as integrals. And in some cases, as when we have right- or left-censored data, the integral can be over an open-ended interval. A big attraction of the Likelihood approach is its ability to obtain information on a parameter by combining information of different types. Another is the fact that the process of finding the MLE numerically involves the curvature of the log-likelihood function, and this provides a direct way to calculate the estimated (co)variance of the estimator..

The exercises are also designed to reinforce the point that not all MLEs have a closed form.

JH will (and you should) try to distinguish between estimator (a procedure, formula, etc) and estimate, the numerical result of applied this estimator to data. Of course, the abbreviation MLE stands for both the estimator and the estimate.

**Remarks on assigned exercises**.

**3.1 Daniel Bernoulli's example.**

See Notes and JH Presentation on 1st Pillar of Wisdom and Fitting of Early Error Distributions. Apart from the intent to bring in a bit of history, this is also and example where someone has worked out the likelihood and maximized it by brute force, but you can simply use**R**. From what JH gathers from others,**optimize**is recommended for 1-D problems, and**optim**for higher dimensional problems.

In case it isn't clear from Bernoulli's written description of his 'error function', here is a picture of it (with r taken as 1). The graph also shows the log-likelihood calculations at 9 different values of the parameter. JH has used theta to denote the parameter -- the (unknown) center of the semi-circular distribution.

Note also that in the R code used to produce these calculations and graphs, the pdf is appropriately 'normalized' so that the area under it is 1. The area of a semi-circle with radius 1 is pi/2. So the proper pdf(y) is (2/pi)*sqrt(1 - (y-theta)^2 ) for y within 1 of theta, and zero elsewhere.

As it happens, this constant does not involve theta, and that is why Bernoulli does not bother with it when trying to find what value of theta makes the product of the 3 probabilities as large as possible.

Notice also that the density itself is not a probability, and that the probability of observing a value of exactly 0.2 is zero. The probability of a '0.2' is really the probability of observing a value that, WHEN ROUNDED or BINNED, is 0.2. Assuming the bin width is 0.1 , the probability is close to the area of the rectangle with height density(0.2) and width 0.1. Again, like the other constant, the 3 base widths drop out when we focus on the part of the log-likelihood that involves theta.

While it would have been temping to show the densities at the 3 observed values, JH has instead calculated and shown their 3 logs, since our focus is alsways on the log-likelihood, and never on the likelihood itself. Indeed, Fisher's 1912 article did not even show the likelihood: he started right away with the sum of the logs. Also, it is easier to add their logs that to multiply the densities. The only downside is that the sum of the logs is often negative, and so to find where the log-likelihood is maximum, you have to find where the sum of the logs is 'least negative.'

Lastly, given this diagram and way of proceeding, you can now easily imagine replacing it with the one for Laplace's first error model, or his second one (the same as Gauss's one).

**3.2 Fisher's example of measurement errors.**

The measurement errors are grouped into classes (bins). Usually we use bins of our own choosing, for convenience, e.g., age bins 15-20, 20-25, etc, or annual income brackets $20K-$40K, 40K-60K, etc... sometimes open ended.

What's different here is that, because Fisher is dealing with a 'symmetric about 0 Normal distribution', he puts the errors between -1 and 0 in the same bin as those between 0 and 1; likewise he puts the errors between -2 and -1 in the same bin as those between 1 and 2. He is cutting down on his numerical work, using absolute errors, since the expected proportion in a 'left of 0' 1-unit wide bin should be the same as the proportion in its 'right of 0' counterpart.

In this example, the frequency data are multinomial, so there is one joint probability, involving the usual product of probabilities, each one an integral. Its log will be a sum, but with some dependence. But, in other cases (e.g., the dilution experiment, or meta-analysis), we merge likelihoods from independent datasets or observations.

Usually, when we learn likelihood, each bin is very narrow (eg someone's height or weight is to the nearest cm or Kg) so the prob. mass in the bin is closely approximated by the pdf(mid-bin) x the width of the bin = pdf(y) x dy. But this is not the case here; while you might get a reasonable approximation to some bin probabilities using the area in a rectangle, it won't work very well for others, and it won't work at all for the (effectively) open-ended bins.

I expect you will realize that there is no closed-form expression for the MLE of sigma-squared or sigma, so you can instead plot the Log-likelihood and visually find the MLE. You might find the log-likelihood involving log(sigma-squared) better behaved than the log-likelihood involving sigma-squared itself. And, of course, you can search in the sigma or sigma-squared scale.

Later on, we will use more sophisticated numerical ways to find the MLE, and at the same time to calculate the curvature, and thus, the precision of the estimate. This iterative method is especially valuable when the parameter is multi-dimensional.

For**3.3 (Galton's data on the speed of piegons)**, most bins are 100yards/min wide, but the '-500' bin extends all the way from 0 to 500 (technically, since we are dealing with a N(,) distribution, from -Inf to 500), but all of these racing pigeons, even the older ones, have speeds well above 0.

Even though they do not involve time (the subject of survival analyses), the data in exercises 3.2 and 3.3 are in fact 'interval-censored'.

In**3.4**(déjà vu) the 2 datapoints are independent, so the log-lik is a sum of two independent log(probability)'s. In this case, you already obtained a closed-from estimator. One of the points of this question was to emphasize that for ML estimation, one needs a fully specified distributional form for each observation. In contrast, the LS estimator did not require a distributional form for the errors about the 'line of means'; indeed, the LS estimator is a purely numerical line-fitting approach, and doesn't rely on any statistical assumptions. It is only if we wish to describe the statistical behaviour of the LS estimator that we need to specify a complete model.

The observations in**3.5**and 3.6 are examples of censored 'survival' data. One of the points of JH's choice of the 'tumbler' longevity data is that is shows that there are (at least) two equivalent ways of deriving a likelihood.

* One can regard the frequencies as the realization of a single multinomial r.v., just as in examples 3.1 and 3.2. In this approach, all of the 'bin' probabilities must add to 1, so we can think of it as an 'unconditional' approach.

* By focusing on week-specific 'failure' rates or 'hazard' rates -- one can proceed week by week, and treat each week as a separate (conditional) binomial. In this approach, each conditional probability and its complement add to 1 within each week, but across weeks, the week-specific failure probabilities are not constrained to add to 1.

The data in 3.6 have a very special structure: every observation is either right-censored (if the person is pulled out alive) or left-censored (if pulled out dead). The same kind of**'current-status'**data arise if we conduct a maternal-and-child survey, and ask (i) the infant's age and (ii) whether the infant is still being breast-fed, and wish to convert the responses into a "still-breastfeeding' curve that shows, for each week or month of age, what proportion of infants are still being breast-fed at that time point.

In the ML approach, each likelihood contribution is the integral of the pdf from either 0 to 't' or from 't' to Inf. As we will discuss further in class, it is not easy to come up with a single sensible pdf for the avalanche data, since there are at least 3 separate causes of death, such as trauma, asphyxia, and hypopthermia, and the dataset does not distinguish them. But, to keep it simple, for the purposes of this chapter, we will adopt some a single 1- or 2-parameter distribution.

In**3.7**note JH's suggestion to re-parametrize the proportions so that they always stay within their bounds.

**3.8 Dilution series**is a very good example where we do not have any obvious and easy estimator, the way we often do for LS. This was one of Fisher's very first (and very compelling) uses of ML estimation.

**3.9 Pooled Testing**Another example where we do not have an obvious estimator.

**3.10 Accuracy in Throwing Darts**As JH says in the preamble to the questions, the article by Tibshirani et al. is a delightful teaching article, that brings in several topics.

**Part 1**of the question brings us back to the sum of the squares of 2n independent N(0,1) random variables. So, knowing what its distribution is, it is easy to calculate its expected value, and to equate this with the sum of the squares of the 2n observed errors. This is the method of moments.

Here, with all of the squares rolled into 1 sum or mean, we have one observation (the mean squared error) and 1 parameter (sigma-squared).

So, there is no 'conflict.' This is unlike our '2 datapoints and a model' example where we had 2 'conflicting' observations and 1 parameter, so we had to take some intermediate value as a 'best' compromise (which intermediate value depends on the model and the 'loss' function).

If we chose the LS criterion in the '1 data point 1 parameter' situation of the erros in darts, then we can make the sum of squares between the observed mean and the fitted sigma-square to be zero.

Similarly, since we have just 1 observation from a scaled chi-squared distribution, it is obvious that the likelihood (or log-likelihood) is maximized by taking the MLE of sigma-squared to be the observed mean squared error.

For the second last part of Q1, we can apply the same 'trial and error' approach we used with the Clopper-Pearson CI for the parameter of a binomial.

The last part of Q1 goes back to the chi-sq distribution with 2 df, and to the grouping of each pair of squared errors to form an exponential distribution. So the sum of these n sums is then a gamma distribution with shape parameter n.

The square root of the chi-sq distribution with 2 df has a lot of practical applications, and even has its own name: the Raleigh Distribution. According to Wikipedia, the distribution is named after Lord Rayleigh.

In other years, JH asked students to use the EM algorithm to estimate the parameter sigma-squared. And indeed, Tibshirani's main purpose was to show the EM algorithm in action. Indeed, this darts examples is a very nice example, because the scores give you incomplete information on where the dart landed. For example, a score of 18 might have been a single, or a double-9 or a triple-6. So several landing-segments lead to the same score.

The EM begins by starting with a guessed-at values for where the incomplete data come from. Then, using these, the 'E' step works out what the**e**xpected value of the 'complete-data' log-likelihood would be, and then in the 'M' step,**m**aximizes this to get a new sigma-squared estimate. The process continues until the estimates do not change.

It might be easier to explain this in the case of Fisher's binned-errors example. One could start by putting all the observations at the mid-point of their bins, and then estimate sigma-squared. One could then use this to compute what the expecetd value of the 'complete-data' log-likelihood would be, and what that implies about the locations of the complete data. You would imagine that it moves them inwards i.e. towards the left boundary of each bin. The expected values of these are used to compute a new parameter estimate, and so on and so on.

It turns out that this is not the same as simply moving the y's to the left: as you will see in the R code, the complete observations would have contributed to the log-likelihood through their squares. So in fact we need to know where the new 'centre of gravity' is, not of the values, but of their squares.

**Part 2**There is a reason JH brings up this Raleigh distribution, because it measures the Euclidean distance from the centre of the target. And the rings on the dart board divide this radius into 7 regions, whose probability masses are governed entirely by the parameter sigma-squared. We thus have a multinomial distribution with these 7 probabilities; the problem is very like Fisher's one with the binned errors, except that the widths of the bins here are set by the person who designed the dart board (see the Tibshirani paper, of JH's R code, for the spacings).

**3.11 EM for 'heaped' data.**

**3.12 Age-at- menarche data.**

A very nice example of binned (censored) data, and of how easy it is nowadays to fit various models -- once one is able to write down the generic likelihood.

**3.13 Coarsened ('suppressed') count data.**

Same 'old', except that most bins are just 1 unit wide, whereas the bin with coarsened counts spand the counts from 1 to 4 inclusive.

Here, since there are no covariates about the different observations, the amount of computation to be simplified if -- for the observations with exact counts -- one first adds up the numerators and the PY denominators, i.e., collapse the exact observations into a single observation. However it will be good practice for you if you have each observation contribute separately to the log-likelihood.