- Discussion of issues
in JH's
Notes and assignment on models for a (sample) proportion

Answers to be handed in for: Exercises 0.1, 0.2, 0.5, 0.6, 0.7, 0.8, 0.10, 0.11, 0.13, 0.16

**Remarks on Notes**:

These notes are based on those he developed for the course Principles of Inferential Statistics in Medicine, which JH taught to incoming students in the epidemiology graduate program 1980-1993. As such, they emphasize the 'end-product', rather than how the product was arrived at. In bios601, we will emphasize both. [One of the illustrations, via real data, was the frequency of duplicate birthdays in the various size classes he taught -- he started collecting data in 1981, when the class size was already 27!]

**Section 1**1 (Binomial) Model for (Sampling)Variability of Proportion/Count in a Sample.

Think of a binomial as the sum of n i.i.d. Bernoulli random variables with (common) positivity probability pi.

And in applied work, don't use the terms 'success' and 'failure' that mathematical-statisticians do; instead, be practical and speak/write generically of 'positive' versus 'negative, or better still, if the context is appropriate, of the presence/absence of a 'characteristic/trait/state of interest'.

Take note of the various notations.

You will be surprized how often the use of the binomial arises in other contexts than the traditional. A common example is when comparing two rates, where by conditioning on the sum of the two numerators, we arrive at a binomial.

The 'requirements' for a binomial are not as simple as they might appear, as we will see in some of the assignments. In particular, be clear about the difference between lack of independence (but with a common pi) and lack of a common pi (but with independence), i.e., between I.notI.d and notI.I.d !

JH finds the binomial tree (and generalizations of it) very helpful. If he had time to redo it in R (he made it with some very old software) JH would use another example that the overused pi=0.5 -- and frequency of a certain outcome in thumbtacks rather than coins as the parameter of the inference.

In going through the examples in 1.1, once you are sure the n is fixed ahead of time, go through the Indep. or not? Ident. or not? checklist.

1.2 (Calculation) is not the big issue it was when JH began teaching.

2. Inference We will leave this until we have a common way of approaching point and interval estimation.

You have already dealt with the 'large n' situation in your survey sample of ocean versus land locations.

The "Exact, first-principles, Confidence Interval" construction shows that it is not that easy to treat frequentist confidence intervals separately from p-values: that one needs to be comfortable with p-values BEFORE being introduced to CI's, even though the modern tendency is to downplay p-values and to play up CI's (interval estimates).

The diagram in Figure 2 is worth studying. In addition to 'after the fact' calculations of precision and margins of error, JH uses it regularly in consultations when the question of sample size for a survey arises. We could use it to decide what precision we would get with estimates of the percentage of the world that is under water obtained with various sample sizes. Of course, ahead of time, just as you had to do with the negative binomial calculations, you would have to make a guess as to what the 'true; percentage is. But you can see the =/- 3 percentage points margin of error that the survey agencies use for all their surveys of size 1000 or so. They are a bit lazy and calculate the worst margin of error (ie the one at 50%) and use that for all situations, even when, if the percentage is closer to 0% (e.g. % of Canadians who have a PhD, or % of physicians over 60 who are female) or 100% (e.g. % of Canadians who have cell phone/computer, or % of time they spend indoors), assuming 50% leads to an overestimate of the unit variance [ pi(1-pi) ]. Between 30% and 70%, the unit variance does not change at lot .. at 0.7 or 0.3, its 0.7 x 0.3 = 0.21, and 0.5 x 0.5 = 0.25 at pi = 0.5. It's only when it gets below say 0.2 or above 0.8 that the unit variance falls quickly. So, using the worst case scenario is a safe 'conservative' approach, in that the actual margin of error, calculated after the data are in, has to be smaller than that based on the 0.5 x 0.5.

Method C in section 2 is the most logical and sensible one, and it ties in with the logit being the natural ('canonical') link when dealing with binomial regression via generalized linear models.

**Section 3**.

It took JH many years to find a satisfactory answers and examples for those who are baffled by these concepts as to why the proportion of a population that is samples is (usually) not as important as the number who are sampled.

Section 3.3 shows how far the media has come 30 years!

BE VERY CAREFUL with WORDS

It is**meaningless**and incorrect to say "**this**sample is right 95% of the time." Instead, saying that "**95% of samples this size**are right" comes closer to what is claimed when we give a 95% assurance.

**Section 4**.

At this stage, the key item is 4.2, the Normal approximation to the Binomial. A useful way to think of the rules of thumb [ n.pi > 5 and n(1-pi)>5, or n.pi > 10 and n(1-pi)>10] is to think of trying to overlay the normal curve on the (discrete) binomial distribution, so that not too much of the one or other of the two tails flows 'out of bounds' i.e., giving a count or proportion < 0, or a count > n or proportion > 1. The normal approximation works well if the approximation to the 'short tail' of the binomial can be overlaid comfortably over the binomial tail.

**Section 5 (Sample size, precision/power)**We will leave this until we cover a common (generic) way of approaching these issues.

**Remarks on assigned exercises**.

**0.1 (m-s) Working with logits and logs of proportions**The logit is absolutely central to epidemiologic data analysis, and so you need to be quite comfortable working in this scale, and then going back to the related scales.

You also need to become very comfortable with the**'Delta method'**for calculating the (approx.) variance of a transformed random variable. The topic is usually taught, often with not much motivation or intuition, under the topic 'Change of variable'. JH thinks a better name would be 'Change of**SCALE**', since the entity under study remains the same. He gives the example of variability, over some time period, in Montreal temperatures. There is really just one r.v., namely temperature. What scale it is measured on is arbitrary and almost secondary. See more on this topic in the teaching article by Hanley and Teltsch. And, when you come to 'Jacobians' in your other classes, this article will make them more real and intuitive.

The reason JH wrote this piece is that he was tired of trying to remember whether the Jacobian in the new density involved 'dy/dx' or 'dx/dy'. Now, he has no trouble doing so. He knows that the SD(variance) of temperatures on the Fahrenheit scale must be 1.8 and 1.8^2 times the SD(variance) of these same temperatures on the Celsius scale. Conversely, the C scale is only 5/9 ths as wide as the F scale. The C(old) -> F(new) transformation is a linear 1.8 magnification of the C scale. So if the temperature are more spread out on the new scale, then, in order to conserve the full probability mass (ie pdf integral = 1), the y-axis for the pdf on this new scale only goes up to 5/9 this of what it did on the C scale. So,

new.pdf(F) = old.pdf(C equivalent of F) x (5/9) = old.pdf(C equivalent of F) x dC/dF

or, generically,

new.pdf(*) = old.pdf(equivalent of *) x ( d.Old/d.New )

In many applications, such as with the logit, (and in just about all of the examples in textbooks) the magnification is not the same at different places on the scale. Moving from pi=0.5 to pi=0.6 moves the odds from 1:1 to 6:4 or 1.5:1, and thus its log from 0 to 0.405; Moving from pi=0.8 to pi=0.9 moves the odds from 8:2 or 4:1 to 9:1, and thus its log from 1.386 to 2.197, a difference of 0.811, just about double the 0.405 closer to the centre.

[JH doesn't not understand why textbooks don't begin with linear transformations]

**0.2 (m-s) Greenwood's formula for the SE of an estimated Survival Probability**

When JH took mathematical statistics, one of the applications of the Delta method was to work out the (approx.) variance for the surface area of a table that was nominally W units wide and L units long, with a 'manufacturing' error, or variation, of epsilon, on W; and a similar (independent) one on L. He was told that sigma(epsilon) was small relative to W and L (ie the coefficients of variation, CV_L = 100 x sigma(epsilon_L) / L and CV_W = 100 x sigma(epsilon_W) / W) were low, so that even if we assumed Normal distributions for the two epsilons, the probability of a table with a negative dimension was negligible. One way to arrive at an approximate variance was to expand the product (W+e_W)(L+e_L) and then ignore the small e_W x e_L component. Another was to use the Delta method for the log transformation to derive the variance of the log of the product, and then transform back (again using the Delta method for the antilog transformation.

You can try that same exercise if you want but I don't expect you have that much free time right now! (Recently Amy Liu and I had to deal with a variant of this problem, involving correlated errors, when dealing not with a product, but with a quotient, of two r.v.'s in an analysis of the errors in quotients caused by using input values extracted from digitized images.

Exercise 0.2 involves a classic formula that biostatisticians refer to as "Greenwood's formula" and that is central to epidemiologic and biostatistical data analysis.

I suppose you could give represent each component in the product as the 'true' amount plus an epsilon, and expand the product, and ignore lower order terms, but it is probably easier to work with the variance of the log of the product, and then to back to the original scale. By the way, there seems to be a fixation on having the variance on the original (0,1) scale, even though Gaussian-based confidence intervals calculated on this (0,1) scale run the risk of going out of bounds. Maybe we should obtain the variance in the (0,1) scale and then move to the (unbounded) logit scale and calculate the CI there, THEN take the anti-logit to return to the (0,1) scale. There was a question on this in the 2012 Part A exam for PhD students in biostatistics.

**0.3 (m-s) Link between exact tail areas of Binomial and F distributions**

Since this problem is a first cousin of exercise 0.4, JH has instead assigned the cleaner 0.4 one. Moreover, as we saw in the assignment on ruptures over a trip of 7500Km, it is easier to 'see' the link between the Poisson and the Chi-Square tails than between the tails of the Binomial and the F.

**0.4 (m-s) Link between exact tail areas of Poisson and Chi-Square distribution**

Déjà vu la semaine dernière, so you can use an intuitive' proof if you like. Or, if you wish, how about using a "proof by induction' ? [JH doesn't see many proofs done this way any more] Or any other method you can find - if from internet, please credit your source!

**0.5 Clusters of Miscarriages**

Assume that -- even though, within a company, the risk of miscarriage varies from women to woman -- the pregnancy outcomes for different women in the same company are independent. The main point of this (real) story is what some would call the law of large numbers -- if there are enough companies, it will happen in 1 (or more) of them. And of course, there is also the fact that we tend to notice extremes.

For more on this issue of co-incidences, and if you want a break from the 'harder' stuff, you can look at an article where JH has collected several stories involving the same law of large numbers, and the same fascination with (benign) co-incidences. Of course, in more serious situations, such as clusters of leukemias and miscarriages and the like, it is not so easy to convince people that it's all 'just' co-incidence. And indeed, in any one instance, it is not easy to distinguish a cluster that was caused by some noxious agent from one that is a merely 'random' one.

The "Births Case 3" in that collection -- about numbers of twins in a school -- is probably the closest in structure to the one on miscarriages. JH was also struck by the role of 'filtering' that goes on in human-interest stories, and the tendency of journalists to stretch the details even more to make the odds even longer and the story all the more remarkable! And the fact that the same number in both states (Lottery case 1) was more easily noticed because the two states were beside each other in the alphabetical ordering means that there might well have been other days when two states that were not near each other in the list had the same number -- but were not noticed.

**0.6 "Prone-ness" to Miscarriages ?**

Here we see the (absence of?) one of the other requirements for a binomial. part 4 of the question deals with this.

One could carry out a formal chi-squared test of the goodness of fit of the expected numbers under a (common) Binomial. You are not asked to go that far, just do a visual test. By think about how many degrees of freedom the statistic should have: it's not 5-1 = 4, because, in addition to the constraint that the 5 frequencies add to 70, there is also a further constraint imposed by the fact that the expected (ie fitted) frequencies must give an overall 30% miscarriage rate.

If you were going to carry out a formal test, one other issue would be how accurate the Chi-Sq distribution is when some of the expected numbers are low, ie < 5.

To be thinking about, and particularly in light of Dr Moodie's presentation on random effect models: Imagine (simplistically) every woman as having being born with a different probability of having a miscarriage, and that, given that probability, the outcomes of successive pregnancies in that woman were all Bernoulli with that same probability. The different probabilities are called 'random effects. Question: what would the shape be like?

If this example is over-simplistic, think of how much of the year each person spends indoors, and what responses you would get if you selected all (or a sample) of them, and called each of them at 4 randomly selected times (from all of the 60mins/24hrs/7/52/ over the year). Certain people would have lower and some would have higher probabilities of being indoors. What do you think might be the shape of the distribution of person-specific probabilities?

**0.7 Automated Chemistries**

Here can you see the absence of one of the requirements for a binomial?

In part 3, an informal 'eye fit' is sufficient.

BTW: by 'normal' Ingelfinger means 'apparently healthy'.

BTW2: How do you think hospitals, and companies who sell them equipment for testing, establish their 'limits of normal' ?

**0.8 Binomial or Opportunistic? Capitalization on chance... multiple looks at data**

This is very much in the same spirit as the 'law of large numbers' mentioned above.

JH recently came across an amusing example of astronomers (all mathematicians) and the pope. [from Wikipedia:] Pope Clement VI reigned during the period of the Black Death. This pandemic swept through Europe (as well as Asia and the Middle East) between 1347 and 1350 and is believed to have killed between a third and two-thirds of Europe's population. During the plague, Clement sought the insight of astronomers for explanation. Johannes de Muris was among the team "of three who drew up a treatise explaining the plague of 1348 by the conjunction of**Saturn**,**Jupiter**, and**Mars**in**1341**"

Clement VI's physicians advised him that surrounding himself with torches would block the plague. However, he soon became skeptical of this recommendation and stayed in Avignon supervising sick care, burials, and the pastoral care of the dying. He never contracted the disease.

How many candidate years, and how many candidate planets (and how many other causes?) did these mathematicians search before finding the 'co-incidence'?

This is a bit like not knowing about all of the discarded (unpublished) instances of P > 0.05, and only seeing the 1 (published!) instance of P < 0.05! The p-value loses its interpretation if there is selective reporting.

**0.9 Can one influence the sex of a baby?**

You are not asked to hand in answers to this question, but (besides it use of the Normal approximation to the binomial -- or you could use the exact binomial tail using software such as Excel or R) it is a good example of the need to beware of selective reporting. Imagine that each of 100 researchers tried a different way to toss 145 coins, (or each of 100 biostatistics students used a different random seed to generate 145 Bernoulli(0.5) realizations) and only the ones who get 'statistically significant' deviations from 0.5 report their findings. Many people are worried that that type of selective reporting is going on in science, aided by the tendency of journals to only publish 'statistically significant' deviations. If you have time, Google the name John Ioannidis, who has been leading a campaign for honest reporting, having found that many so-called findings are nor reproducible. In meta-analysis, this phenomenon has been called the file-drawer problem. The funnel-plot is a useful way to see if the p-values that do get reported are representative.

**0.10 It's the 2nd week of the course: it must be Binomial!**

fixed n? and i.i.d. ? If you like, suggest some of your own!

**0.11 Tests of intuition**

You can see why we can get more 'extreme' results in small samples.

**0.12 Test of a proposed mosquito repellent**

**0.13 Triangle Taste test**

This is a particularly good one to teach/learn about sample size and power -- directly using the exact binomial -- with no normal approximation to get in the way.

0.14 Variability of, and trends in, proportions

0.15 A Close Look at Therapeutic Touch

are 2 real applications of the binomial.

**0.16 We shouldn't trust statistical calculations to those who can run a statistical or mathematical package, but do not have training in mathematical statistics and statistical inference! (this is a real case, involving doping in sport)**