BIOS601: warm-up for bios601, 2022

[updated Aug 10, 2022]

The objective is to use
Q1,
Q2,
Q3,
Q5,
Q6(parts i to iv; just paragraph 2 of part v, ie just the PLAN),
Q7(2020 parts ii to vi; 2022 part i),
Q10,
Q11,
Q12,and
Q13 of these exercises
to recall and apply statistical concepts and techniques you have already encountered in your training up to now, and to get up to speed in R.

Some of these concepts/techniques are 'hidden' in (deliberately) practical exercises, since JH is a firm believer that in graduate school, the rhythm should not be 'technique -> example' but rather 'problem -> ? technique'.

Whatever answers to these Qs in red you are able to put together by 5 pm ET on Friday August 26 should be uploaded to MyCourses by then. Any 'JH-readable' format (email text, pdf, word, scans or photos of handritten material, etc) is acceptable, as long as the files are not prohibitively large.

You can also upload your answers as you get them done, ie. a few at a time, as you proceed. That might be even better, as it encourages the 'lifelong learning' attitude that JH advocates, and breaks the (all-too-common) 'everything the night before the deadline' habit that cannot be as good a way to master the material.

Moreover, don's spend an inordinate amount of time on some one Q. The purpose here is to refresh and practice what you have learned, and to 'stretch' you a bit. But it's fine too to tell me (JH) what you are not able to do. Different students have covered very different things in their training up to now.

THIS ASSIGNMENT is NOT MANDATORY, and will not count in your grade. BUT, as you will soon gather from me, the course is not about 'grades' and undergraduate attitudes, but about you becoming a curious, independent, eager, self-learning and practical biostatistician, who cares as much (or more) about the names of the variables, and how they were measured, and doesn't simply treat them as Xs and Ys the way mathematicians do. Nor is this a course in arithmetic and in getting answers to seven decimal places (like you might have impressed your relatives when you were 9 years of age) or in getting the cleverest way to code, or the nicest looking graphs. And so the sooner you start seeing Biostatistics as a way of life, and something that you immerse yourself and are excited about every day of your life, the more successful you will be.

[ JH does not want you to work on the warmup exercises after that date. Instead, he will be asking you to begin working on the Qs on measurement -- a topic that is less familiar to you and that he will address in a pre-recorded lecture, and then in the 1st 2 classes (on Wednesday August 31 and Wednesday September 07). He expects to have the 2022 version of the measurement exercises finalized soon and to set a deadline of Friday September 09 for the Qs he will assign on the measurement topic. (For now, if you are curious you can look at the 2021 version)

For the first several weeks of the semester (when the material is mostly déjà vu), the rhythm will be: you begin working on the assigned Qs the weekend BEFORE, we discuss the Qs in class on Monday and Wednesday, and you upload your final answers by Friday.]

The first (general) computing issue is (if need be) to get up to speed in the use of R. See the R links on the main course page. If you run into problems, let JH know ASAP.

Remarks on specific questions :

Q1 See the pdf file for a was suggested way to select a (.csv) dataset of 200 observations for the several JH has assembled for students in previous years.

Q1 (ii), Q2 (ii)

Some of the conceptual and practical statistical issues raised by this assignment include the distinction between standard deviation and standard error; the concept of a margin of error; when it is appropriate to use the Normal (Gaussian) approximation to the binomial distribution; the (often under-appreciated) centrality of the Central Limit Theorem (CLT) in applied statistical work, not just for the sampling distribution of a sample proportion, but also for that of a sample mean.

Other points the exercise tries to make are that most of sampling theory involves calculation of variances. These derivations bring you back to fundamentals, and its good to be able to work them out from scratch rather than consult a textbook or the internet and copy the formulae blindly without having an understanding of what the formula should look like.

You will notice that we have already starting calling the sqrt of the variance of a STATISTIC the STANDARD ERROR of that statistic. One key point about a standard error is that is refers to the variability of a STATISTIC, not of individual observations. Some writers reserve the term for cases where one uses a plug-in estimate into a variance formula. For example, they would call sigma/sqrt(n) the standard deviation of the sample mean, but they would call s/sqrt(n) the estimated standard deviation, or standard error for short. Bear in mind that this terminology is not standardized across the profession.

Q2 Again, use the already-assembled dataset.

Q3 part ii. It would be good to use the d- (density/mass) function for this r.v. to make a graph of the probability mass function for the r.v. n'. Doing so will make it easier to see where the 2 tails begin (i.e where the 10th and 90-th percentiles are). The graph will also justify the remark about the 'near-Gaussian-ness' of this distribution.

Q4, 2016

One of the points of asking you to think about various sampling options to get at the 'census' or 100% answer is to get you to think of sampling as a measurement tool. Usually, the fancier and costlier the measuring instrument, the better the measurement. But we can't always afford the million-dollar answer and have to live with the hundred dollar answer.

I once encountered a high-up Air Canada executive who didn't like the fact that the Canadian long-form portion of the census involved only a 10% sample, and he wasn's sampled. So he did not trust the results, since not everyone was surveyed. I asked him whether when a doctor drew a blood sample from him, he should give 100% of his blood so as to get an accurate concentration value.

The Harper government did away with a the 1 in 10 random sample that was a compulsory component of the Canadian census, and replaced t with a 1 in 3 sample, so that the large refusal rate bigger sample size would still leave about the same n. What you you think of estimates based on this scheme, which has had only about a 65% participation rate? The participation rate in the old 1-in-10 sample was 99.999% and (despite Harper's claims to the contrary) only a handful complained or refused.

One of the first acts of the Trudeau government was to re-instate the mandatory long-form census.

A lot of psychological or psychometric measurement also involves sampling -- of items to use in a time-limited questionnaire or exam that can only contain a sample of the items that might be asked about (think of the format of the old paper-based GRE exams described in the Measurement Statistics Notes). The measurement model
observed estimate = TRUTH + error
is the same whether the error comes for sampling of items or of persons or of time. In one case we might call it measurement error, and in another sampling error. But it could include both!

Is it worth extracting and entering all of the step counts in all of their digits to have an answer that has too many decimal places for what one wants? And in so doing, we overlook other SEs -- ie statistical errors than do not decrease in magnitude as we increase the sqrt(n)! You can probably think of some in the case of the StepCounter!

JH likes to say that besides standard error, the abbreviation SE could stand for many other types of error. It could be SAMPLING error, or STATISTICAL error, or STUPID error. Sadly, statistical theory is only good at quantifying sampling error, where the sqrt(n) is always in the underside of the formula. BUT IT DOES NOT KNOW HOW TO JUDGE OTHER TYPES OF ERROR, and SOMETIMES THESE CAN BE A LOT LARGER THAN THE SAMPLING ERROR, AND THESE NON-SAMPLING ERRORS CANNOT BE MADE SMALLER BY INCREASING n.

(Note added 2018):

Do you notice anything 'special' or 'different' about the 2016 data? It should not take any fancy statistical tests .. just the 'intra-ocular traumatic test'. INCREASING n will just make the answer MORE PECISELY WRONG

Q5 Part of this q. returns to the question posed about Q1 at the very beginning, namely how to draw a sample from a non-uniform distribution.

Q6

[Some of the following remarks address some 'labour- and time-intensitive' questions from other years that are being omitted in 2022]

Like the others, Q6 is blend of the theoretical and the practical. Here you are asked to use R to read in the .csv file, and to produce some summary statistics, and calculate some standard errors. You probably have not worked out the SE [sqrt(Variance)] for a ratio, but this is a good example of something often used in applied work. Hint: the log of a ratio is the difference of its components; the approx. variance of a log of a positive rv is (by the Delta method) the original variance times the square of the Jacobian or scaling factor, evaluated at/near the centre of the old scale. Think of the variance of September temperatures in F as the variance of September temperatures in C, multiplied by the square of the scaling factor... the scale of F is 9/5 ths larger than the scale of C. If the scaling is not linear (eg an elastic band) use the scale factor at the centre.

Reference is made to the Finite Population Correction (FPC) for the sampling variance. This would apply in cases where you sample n (< N) of the N members of Canadian or U.S. Senate, or some of the 40 pages of gasoline purchases.

It is written in slightly different version by different people, but JH tends to think of it's approx. value as (1 - n/N). To get the proper variance (assuming in our case that the target is just these 2 years, nothing else), the variance computed under an infinite population or sampling with replacement assumption needs to be multiplied by this less-than-unity factor, so that in the limit, if we sampled all n of the N, so that the FPC = 1 - n/N =0, the variance for our (census) estimate is 0.

The form of the FPC can be derived for the binary response case using the ratio of the hypergeometric to Binomial variance: the binomial is for samples from an infinite population -- or a finite one but sampling with replacement-- whereas the hypergeometric is for samples from a finite one, but without replacement.

Your other choice of method to sample the days from the scanned logs can be based on what factors you think most affect activity, and can be used in the sampling design. There is no one best one a priori (indeed the computerized data from 2010-2011 could be used to test out various designs/estimators but this is not required for the exercise, which is designed just to get you thinking about the issues.

When using R (or another random number generator or random number table) keep track of how exactly you started the sequence (see set.seed in R).

One of the savings to think about when entering data is discarding digits... what would be the effect if you only entered thousands of steps (rounded or truncated to an integer number of thousands) or hundreds of steps. Can you anticipate how much 'damage' is done by such approximations?

If part (v) is taking too much time, skip for now. JH is keen to see your answers re the 2020 add-on in part vii: The first of 2 questions on changes during lockdown.

But even without the lockdown, you will find that its is difficult to see a consistent year-to-year pattern in JH's activity. A lot of the patterns are very particularistic, since in recent years JH has had a few more degrees of freedom than his younger colleagues. Indeed, you might say that it is hard to fit a model to JH's activity, unless you add some large 'random effects' -- which is shorthand for saying 'I have no idea why his activity is SO variable, and would have to ask him personally'.

If you have time to pursue the links given with part (b), they are worth following.

Q7

(If you have time, read through the questions from earlier years, just to see what is involved, and to think about how you might have dealt with them) JH recently changed cars 18 months ago, thus the new 'twist' in 2020.

Notice that we might handle 'uncertainty' differently than we handle the 'errors' or 'random variation' we are better at dealing with. Some (especially Bayesian statisticians) might be able to combine both, by treating the uncertainties (e.g., how much gas was in the tank, and how many Km it had done when we bought it) just like additional random variables. Many sensitivity analyses (like covid projections) just take 'sensible' best and worst case scenarios, but don't put a probability distribution on the possibilities in between.

You can see that now that there is a longer data record to July 2022, the 'uncertainties' have less impact, and you might be able to remove them altogether. Of course the more traditional statistical fluctuations also have less impact as the data series gets longer. This is one of those cases where the impact of the uncertainty and the sampling errors are both reduced by a longer series. But is there was uncertainty as to whether the odometer is well caibrated, or if gas pumps (systematically) give less gas than they indicate, then increasing the length of the series won't improve matters. It may improve the precision of the fuel economy estimate, but it may be precisely wrong (we will address these issues in the Measurement Notes).

With this -- and any other -- question, email JH if anything is unclear, or if you find yourself spending a long time trying to figure out what is being asked -- it may well be that the issue is with the wording rather than you! Remember that the price of having homegrown and updated-yearly questions is that there is less opportunity to beta-test them the way the many boring questions in math-stat textbooks are. And even then, some -- like the one that Horton talks about (Q8) -- are not caught for many years.

link link link

JH inserted parts (iii) to (vi) here, because he has noticed that even though these insights are important to understand, they are rarely taught well in regression courses, and instead get lost in matrix algebra formulae. If students truly understood ('owned') these first principles, they would not be at the mercy of opaque black-box 'canned' (commercial, packaged) software for 'sample size, power and precision considerations'.

A colleague and I, and I on my own, have tried to get people off these, and to just think/work them out from first principles. Our efforts are here and here.

Q8

This Q was new in 2018, and was prompted by a talk by Nicholas Horton at SSC that summer. Nick has been complaining about the not-always-very-imaginative way math-stat continues to be taught, sometimes just as in the 1950s ! Nick has a very inspiring article on how the teaching of math-stat can be modernized, without loosing the mathematical rigour. And, instead of just complaining, he is doing something about it.

JH has often complained about the many exercises in (the otherwise very good) textbook by Casella and Berger, where the sole task seems to be to integrate a pdf over some 2-d region, without any motivation, or care for how such an issue could arise in practice. We will encounter a practical example of integrating over such a region when we come to the 'getting from the Peel to the Vendome Metro stations' in a few weeks.

Some additional links on Nick's work are here and Teaching the Next Generation of Statistics Students to 'Think With Data': Special Issue on Statistics and the Undergraduate Curriculum and this one on math-stat courses. I Hear, I Forget. I Do, I Understand: A Modified Moore-Method Mathematical Statistics Course Sadly, progress in this latter area is slow. But JH is trying to sneak his ideas into bios601.

Q9

This Q was new in 2019, but it is an old problem. JH heard of it from McGill Math-Stat prof David Wolfson. He would give the 'for real' assignment of actually tossing the coin 200 times, and then amaze the students by identifying which of them really did, and which ones merely 'made up' the sequence. David retired in 2021, and JH had interviewed him in 2022, and plans to share the video once it is ready.

Q10

This Q was prompted by JH's recent work on tracing the history of what we now know as the 'Poisson' distribution. The link to the article is here.

Q11

The Weibull distribution is widely (and sometimes unthinkingly) used, but its origins are seldom if ever read or taught. The original article might surprize you.

Part (v) may also surprize you. It is not as versatile a distribution as you might have been led to believe, and it certainly would not fit the data in the Nature paper on the UK (now alpha) SARS-cov-2 variant.

Q12

Whereas Q11 does not really fit in this 'sampling' topic, The 'biased' sampling addressed in this Q is a subtle one, and one that often traps people. This Q is new in 2021, because JH just recently came across the Galton article, which is the earliest description of the subtlety that he is aware of.

JH made a diagram on the lengths of hospital stay many years ago. Earlier this summer, he threw out just about all of his paper files, including his real data on hospital stays. Many decades ago, before computers, each hospital printed many copies of the daily hospital 'census' -- a list of 800 patients, it ran to many pages, and a copy was placed in each location (such as radiology, labs, etc) so that a worker (or a visitor) could quickly find out what ward and room each each patient was in, and who the patient's doctor was etc, and get in contact by phone with the relevant people. It became obsolete after 1 day, and so JH availed himself of such a copy.

Because he no longer has the raw data on lengths of STAYS, he instead simulated them but using the distributions of lengths of English WORDS -- these are real, as are the abbreviations for the names of the books he took then from (you might be interested to know how!).

Q13

This is also very real.

Unfortunately, the study was done in a year (Jan-June 2021) when the classes had to be given by Zoom. So, there are many additional contributions to the epsilons! After the final survey wave in June, one student commented that many of the changes were influenced by the loosening of COVID restrictions and warm weather which have greatly impacted his/her mental health. (S)he thought that some of these changes in my emotional response are not so much guided by the course, but rather by these external environment factors.

This variance calculation emphasizes the combination of unpredictable variations, and reminds us of the early origins of mathematical statistics, where the epsilons were 'errors' in the true sense of the word. They arose when an astronomer could not reproducibly measure the same 'constant' (eg the position or angle of a stationary star, or the weight of a 'standard' weight, say 1 gram or 1 Kg). Statistics historian Stephen Stigler has a very good chapter on this era, calling it "Least Squares and the combination of observations".

The full book "The" history of statistics : the measurement of uncertainty before 1900" can be consulted here. (his newer book, on the Seven Pillars of Statistical Wisdom, available as an ebook from McGill. addresses it also).

Later on, the epsilons were used to denote genuine variations (of say the weights or heights) of different people from some statistical 'middle' of the population. However, the rules for the variance of a combination of epsilons stay the same, whether they are genuine (measurement) 'errors' or genuine (biological) variations. We statisticians should watch our language around non-statisticians. Would you explain to your (same-sex but non-statistician) sibling who is shorter/taller than you why this is due to an 'error' in the family.

The main work in the derivation of the variance is the book-keeping i.e., keeping track of which epsilons are independent or which other ones, and which are (possibly) correlated pairs of epsilons, and where they appear in the 'num'.

You might find it helpful to put the epsilons along the edges of an 86 by 86 matrix, and generalize what we did in Fig 1 in our 'GEE' article but where each w is a 1/39 or -1/47.

Q14

This is probably the most important of all of the Qs !!