BIOS601 AGENDA: Monday September 25 and Wednesday 27, 2017

[updated Sept 22, 2017]

Agenda for Monday Sept. 25 and Wednesday Sept. 27, 2017

Discussion of issues in JH's Notes and assignment on intensity rates:- models

Answers to be handed in for: Exercises 0.1, 0.2, 0.6, 0.7, 0.9, 0.10, 0.12, 0.13 + (if in PhD program) 0.3, 0.5, 0.11

Remarks on Notes:

These notes are based on those he developed for the courses EPIB626: Risks and Hazards and EPIB634 which JH taught in the epidemiology graduate program. As such, they emphasize the 'end-product', rather than how the product was arrived at. In bios601, we will emphasize both.

The Poisson distribution, one of JH's favourite distributions, is at the core of this very epidemiologic topic. It is of course widely used in Physics (eg in counting particles) and the Rolling in the Higgs (Adele Parody) song referred to "a new data-peak at at 125GEv." (cf Figure on page 4) Given it is so widely used in biostatistics and epidemiology, it is surprising that some biostatistician/epidemiologist hasn't composed a song about it.

The recent science news from Iceland, Older dads linked to rise in mental illness
-- Fathers bequeath more mutations as they age Genome study may explain links between paternal age and conditions such as autism.
reported in the Nature article Rate of de novo mutations and the importance of father's age to disease risk (Supp. Info) is just the most recent and striking epidemiologic/biologic example of the Poisson distribution. We will revisit it soon.

Section 1

Notation: JH will use mu as the expected value, just like he uses mu for the expected value of a Gaussian r.v.

Keep lambda for the rate per unit of experience. The product of lambda and the amount of experience is mu.

Make sure to look at some of the Examples, and run some of the Simulations, to be found under Resources.

1.1 (When it applies) is important. There are many misunderstandings about this.

Section 2. Inference

We will leave this until we have a common way of approaching point and interval estimation.

But, you will again see the CLT playing a central role.

And notice the use of the log(mu) scale when making inferences, and in Section 2.2.2, the delta method to get an approx. variance for the log of a count -- assuming the count isn't zero [useful in exercise 0.1!]

Section 2.1.3 and Fig 4 show again the role of the CLT -- if one accumulates a big enough count (by accumulating enough observation time, or 'volume of experience'), and the contributions are independent, the CLT kicks in. And the same 'rules of thumb' about being at least 5 or 10 in the the (this time) lower edge are reminiscent of those for the Normal Approximation to the binomial. Here, we work directly with the expected counts, rather than with n and pi separately.

3 Applications, and Notes

The "How many must I count? [section 3.1] shows an important point about rates (or 'concentrations') made from counts: it's not the amount of experience that creates statistical (in)stability; it's the size of the numerator, i.e., the size of the count. Look at the widths of the CIs in the needle-stick injuries study in Table 1 in section 3.3.

The "divisibility of the experience that underlies a Poisson count is important: the same does not apply to the binomial, where typically the denominator is the number, i.e., the amount, of persons. In Poisson counts, the equivalent is the amount of person-time. Just as in the story of Solomon, who settled the 'child-ownership' dispute between the two women who were claimed to be the mother, persons are not divisible; but the amount of time we observe them is. But, no matter whether its person or person time, the numerator (the count) is not infinitely divisible.

Section 3.6 (CI for an incidence density or rate):

Epidemiologists, and (applied) biostatisticans, are students of rates, not of the Poisson-distributed numerators that serve as inputs to the theoretical and empirical rates. You can think of an empirical rate as a transformed (or scaled) realization of a Poisson r.v. The scaling is so simple that the "delta" method is obvious and immediate.

And think of an incidence density as the epidemiologist's term for an rate or intensity.

Remarks on assigned exercises .

0.1 (m-s) Working with logs of counts and logs of rates The log(rate) is absolutely central to epidemiologic data analysis, and so you need to be quite comfortable working in this scale, and then going back to the rate scale.

0.2 (m-s) The Poisson Family as a 'Closed under Addition' Family

This is a very importan (but often overlooked) property. It is what allows epidemiologists to add the expected numbers of new cancers diagnosed in different age groups and compare this with the total number of cancers observed in these age groups. The age-structure of the source population is determined by many factors, and the cancer incidence rates are usually a strong function of age. Thus, their products (the mu's), and therefor the observed counts, in different age groups are likely to be very different from each other. But their sum is still a Poisson r.v. You will notice that in the study of childhood leukemia near nuclear plants on Ontario (section 3.2), the authors aggregated the numbers of cancers in the different age bins, as the age-specific numbers would be tiny and uninterpretable.

0.3 (m-s) Link between Poisson and Exponential Distributions

Many Web articles and textbooks cover this topic. Give a referece for/link to your favourite clearly described derivation of your choice. There is no need to repeat all the algebra; instead, briefly describe the derivation in your own WORDS.

4 (m-s) Link between tail areas of Poisson and Chi-sq Distributions

0.5 (m-s) Fisher Information

The same random variable provides a different amount of information about one parameter than another. This is merely because we are in a different scale. And you can get there from first principles, or by the Delta Method.

0.6 (m-s) the sixth decimal place

New this year. Stigler's father, mentioned in The American Statistician article, won the Nobel Prize in Economics.

0.11 Enough Coins?

The purpose here is to introduce the idea of a mixture of latent (or unrecognized) classes or subgroups. Here we had seven classes, each with its own mu. And the spread and average of the 7 probabilities is not necessarily the probability distribution at the average of the 7 mu's. It is easy to imagine a generalization to a larger set of mu's, affected (in traffic accident epidemiology) by many variables such as weather, unusual local and bigger patterns and circumstances we are not aware of, etc. And just like we have 'extra-binomial' variation, we often have 'extra-Poisson' variation. Sometimes (as in the births example) we know what causes it (and how to remove this 'noise'); sometimes we do not.

0.12 and 13

Check back later ... JH plans to add some remarks