NOTES ON KKMN CHAPTER 3 jh 1999.05.29 3-0 Where all this is going ... ? ================================= Think of regression as a way to compare two response (Y) means, where the two groups of units are separated by 1 unit on the X axis. It is as though you could find 2 towns whose drinking water fluoride levels were X and X + 1 parts per million respectively and measure the dental health of the children in these two communities. What regression allows us to do is arrive (synthetically) at this comparison without having to have two actual towns that are exactly 1 ppm of fluoride apart in their drinking water. i.e. mean Y when Fluoride = X ppm vs. mean Y when Fluoride = X + 1 ppm. 3-2 Notation (Y = response X = determinant) ============================================= If I were writing this book -- or for that matter an introductory text in statistics -- I would use Y [and Y-bar etc], rather than X [and X-bar, ...] for a variable. Moreover, I would show its distribution along the VERTICAL axis, rather than along the horizontal axis used in Figs 3-1 and 3-2. The reasons for this are to be compatible with all of the diagrams in the rest of the book (e.g. Fig. 5-4 p. 44, Fig. 5-5 p. 45, Figure 5-1 p. 40) where the variation in the response is displayed on the VERTICAL axis, and the response is labelled "Y". Furthermore, when we measure the variation in Y -- whether without or with regard to X -- we used the standard deviation of the Ys about Y-bar or about the fitted line (if Fig 5-7 p. 47). We don't do the same for X, because the X's may have been deliberately chosen by the investigator, and so their variation is not of general interest. The simplest example of my preferred X-Y display is the comparison of means in two groups, for example the birthweights of infants of non-smoking and smoking mothers, where the raw data are displayed as two columns of dots, with the 2 groups indicated as two ticks on the X axis. Linear regression simply segregates the "smoking" more finely along the horizontal axis, according to the amount smoked. See an example of this progression from one vertical column of Y's to 2 (to an infinite number of) columns of Ys in the piece "Bridge fron 607" -- just above these notes on the web page. Median ====== It is not quite accurate to imply that the median does not "use in its computation ALL the observations in the sample". It does! Rather, as is correctly implied, it is LESS AFFECTED by extreme observations. It is interesting that the authors recommend use of median instead of the mean in the presence of extreme-Y values (note my switch to Y from X). If so, why not pursue the same logic when it comes to regression? Regression is all about patterns in the X-value-specific Y-MEANS. Thus, maybe we should pursue the patterns in the X-value-Specific Y-MEDIANS? This would involve switching from Least-squares-regression, used for means, to least-absolute-deviations, used for medians. One of the big (but with computers no longer compelling) reasons for the pursuit of MEANS (and use of least squares etc...) is the fact that the theory for means is much better worked out (as is illustrated by the authors' comment on the Central Limit Theorem). There is no equivalent simple Central Limit Theorem predicting the behavior of the median. If interested in mean vs median, have a look at the short paper by Hanley and Lippman on where to stand when there are 3 unequally spaced elevators www.epi.mcgill.ca/hanley/c697/ Variance and Standard Deviation. =============================== Think of the variance as the average squared deviation about the mean. For this purpose don't make too much of the n - 1 rather than n; instead think of it as a sum of n - 1 independent squares, divided by n - 1, so that it qualifies as an AVERAGE. As for variance being in squared units, imagine measuring the variability in fertility. The units for average fertility are number of children per woman. The units for the "variance" of fertility are "number of square children per square woman", while the standard deviation is back in the same sensible units as the mean. There is another way to think of the standard deviation --- without going through the trick of squaring all deviations, finding the average of those squares, and then taking its square root. Simply think of the deviations without regard to their sign, take the average of these (absolute) deviations. This "average absolute deviation" will not usually match exactly the S.D. (the square root of the average squared deviation) but it will generally be quite close. The use of S (SD) in combination with Y-bar gives a fairly succinct (but not always complete) picture of the spread and center of the data. Imagine if the statistics for the numbers of television sets per Canadian home were mean 1, standard deviation 1. NOW imagine the SAME statistics for the numbers of ovaries (or testicles) per Canadian adult! Fortunately when it comes to STATISTICS, rather than INDIVIDUALS, their behaviour is much more Gaussian (thanks to the Central Limit Theorem). This same Gaussian-ness of statistics, if not of observations, will carry over to regression coefficients. After all, a regression coefficient is nothing more -- conceptually speaking -- than a difference of Y-means, scaled to be per unit of X. Last paragraph of 3-2. ====================== The authors distinguish Y and X according to their role. In this example of Y=Blood pressure against X=age, focus will be on the Blood Pressure averages and SD's at each Age, and differences in these across Ages. 3-3 === It would have been better to defer introduction of the binomial distribution to Chapter 23. If the authors are going to introduce it, they should also emphasize (1) its shape: it is skewed if the probability of an event is towards the 0 or 1 end of the (0, 1) scale, but less so if n is large [a good example of the Central Limit Theorem!]. (ii) the mean (expected) number of events in a sample of n is nP where P= is what the authors denote by the Greek letter Pi. The variance of Y (rather than X) is nP(1 - P). This is largest when P is close to 0.5, and smallest when P approaches 0 or 1. (iii) If NP is not so extreme, then Y will have a close to Gaussian distribution. Indeed it was by this route that Gauss and Bernoulli both worked out the equation for the Gaussian curve (for the equation of the Gaussian ("Bell") curve. For more on Gauss, see any elementary text, or the 10 Deutsche Mark banknote -- which also has the curve itself, along with a portrait of Gauss -- or go to www-history.mcs.st-and.ac.uk/~history//Mathematicians/Gauss.html or www-groups.dcs.st-and.ac.uk/~history/PictDisplay/Gauss.html Note here the distinction between INDIVIDUAL responses and a statistic based on an AGGREGATE of responses. If we were to record whether individuals of a certain age were left or right handed (and didn't allow anything other than a dichotomous answer), we would have a series of individual 0's (say for right handers) and 1's (lefthanders). The variation of these individual responses would be quite skewed (mostly 0's, maybe 15% 1's) but if we examined the counts (sums) in batches of say 50, the variation in these sums would be much closer to Gaussian -- with a mean of around 7.5 maybe, and a SD of around 2.5. Now if one plotted counts for batches of 50, with each batch made up of persons on a certain age, and the ages ranged from say 5 to 95, one might see a gradient in the pattern. this pattern over age might be more pronounced if, instead of left vs. right handedness, we were talking about samples of 50 Canadian physicians of female physicians in each sample, or if our samples were of Quebec residents and we were counting how many among them had visited the emergency room at least once in the last year. 3-3-2 ===== As already discussed, there is nothing "normal" about the Gaussian distribution. It does sometimes occur in nature. For example, Gauss himself saw it as an "error distribution", where measurements pile up near the middle, with very few measurements at the extremes. He saw this mathematically as the "law of cancellation of extremes" where each measurement is the end process of a large number of steps, each one of which would (independently of the others) involve a slight positive or negative error. For other candidates for a Gaussian distribution, imagine the 50th generation photocopies of the 1-metre baton that is kept in Paris, with each generation introducing a small over- or under-estimation. Or imagine how long it takes you to get to McGill each day as a summation of a large number of small (and largely independent) time segments. It has also been observed in nature, in for example the yearly fluctuation in the maximum height attained by the river Nile (before man built dams on it!) and it is observed in distribution of heights and other variables that are under tight physiological control --- note that variables like weight have too many elective components pushing the distribution upwards in Western societies). But by far and away the most compelling indication for the use of the Gaussian distribution is to describe (predict) the behaviour of STATISTICS (such as means, proportions, differences in means or proportions, regression coefficients, ...) calculated from AGGREGATED data. In other words, the Gaussian distribution is mainly applicable to "man-made" or "person-made" variables derived from summation and/or subtraction. Again, in the center of page 18, the sentence should have more appropriately been "This rule also applies to the sample mean Y-bar whenever the underlying variable has a Gaussian distribution, or, EVEN IF IT DOESN'T, whenever the sample size is moderately large." Note that the same would apply to Ybar1 - Ybar2, or to (Ybar1 - Ybar2)/(X1 - X2). 3-4 t, chi-square and F ====================== There is an interesting vignette on the deviation of the equation for the curve for the t-distribution, illustrating how far computing has come in the last 91 years, and how practical and inventive one had to be at the rurn of the 20th century! "Before I had succeeded in solving my problem analytically, I had endeavoured to do so empirically [i.e. by simulation]. The material I used was a ... table containing the height and left middle finger measurements of 3000 criminals.... The measurements were written out on 3000 pieces of cardboard, which were then very thoroughly shuffled and drawn at random... each consecutive set of 4 was taken as a sample... [i.e. n=4 above]... and the mean [and] standard deviation of each sample determined.... This provides us with two sets of... 750 z's on which to test the theoretical results arrived at. The height and left middle finger... table was chosen because the distribution of both was approximately normal..."' W.S. Gossett 1908, (writing from the Guinness Breqery under the pen-name of "Student") There isn't one t-distribution; there are as many as there are sample sizes (and associated degrees of function for estimating sigma, the "population" SD). The authors should have given a few of them in Fig 3-4(a), showing that the heavier the tails, and that as the degrees of freedom become large (so that S is a reasonably good estimate of sigma), the closer the t-distribution is to the Gaussian distribution. The example given after equation 3.6 gives a difference of two means, divided by its standard error [we say "standard error" when we substitute an estimate of a parameter into a formula for the variability of a statistic]. It would have been more instructive to give Ybar1 - Ybar2 as the theta-hat, because it would be a preview for our slopes of the form (Ybar1 - Ybar2)/(X1 - X2) we will be dealing with throughout the course. And the pooling of variances here is a preview of the pooling of squared residuals throughout the rest of the course. For chapters 5 to 16, the chi-square distribution is not that relevant. It will surface again in Chapters 23 and 24 dealing with counts [where variances don't have to be estimated separately from means]. The F distribution is central in regression, particularly if we test 2 or more variables simultaneously. If we deal with just one variable (ie one regression coefficient) we can make do with the t-distribution, since the corresponding F distribution is none other than the square of the corresponding t-distribution (mentioned at the end of section 3-4). 3-5 === Again, it would have been better to use Y than X ... in regression we do not treat the X's as random variables; instead we treat them as though the values had been chosen by the investigator. In example 3-2, it would have helped to think of the responses (changes in health status) as Y's and the two groups as indexed by a binary X variable with value 1 for group 1 and 0 for group 2. Then the 15.1 minus 12.3 is the difference in Ybars and the 1 - 0 the difference in X's, giving a slope of (15.1 - 12.3)/(1 - 0). Only the numerator of the slope is random. It would also help to think if the variations of the group 1 observation from Ybar1 as "residuals". These individual residuals are squared and aggregated with the squared residuals within group 2 to form one (average) estimate of the within group variation. Problems (p. 30 - 32) ===================== Problems worth working through: 3, 5, 6, 8, 10 (you don't have to agree with the authors on part a!), 12, 14, 18, 20. Note that the authors "force" the assumption (in problems 11, 13, and 16) that weight would have a "normal" distribution across subjects. This is highly unlikely. But all is not lost with this unrealistic assumption, since if the sample sizes are any way decent, say even in the double digits, the central limit theorem would help ensure the closer-to-Gaussian behaviour of the statistics (the means) even though weights DO NOT have a Gaussian distribution. Note also the highly confusing use of the word "normal" eg in 17, samples are drawn from "normal" population; what the authors want to say is that the variation of the variable of interest is Gaussian. What is missing from this "Review of Basic Statistics" Chapter, especially since the rest of the book is about estimating slopes (ie regression coefficients). 3.7 Prelude to the variability of slopes: variance formulae for certain linear combinations of responses. =========================================================== Example: Because of measurement erors, measurements on a subject at time 1 vary around the true mean M1; measurements on the subject at time 2 vary around the true mean M2. Let's say the measurement errors have same variance at time t1 and t2. Say the observed measurements are m1 and m2. Form estimate of change per unit time (slope) using (m2 - m1)/(t2 - t1). Then m1 = M1 + e1 m2 = M2 + e2 So estimate of slope is (M2 - M1)/(t2 - t1) + (e2 - e1)/(t2 - t1) i.e. true slope + error in slope (assume no error in t2 - t1). The variation of (e2 - e1)/(t2 - t1) is 1/(t2 - t1) times the variation of e2 - e1. If e2 and e1 are independent, each with standard deviation S, then e2 - e1 has a standard deviation of sqrt [S-squared plus S-squared] or sqrt[2] times S. So the standard deviation of all the possible slope estimates is 1/(t2 - t1) times sqrt[2] times S. Notice that the variability comes from the variability of the measurements (S) and that the variability is inversely related to how far apart the two time points t1 and t2 are. This is the SIMPLEST example of the variability of a linear combination of random variables. Here the two random variables are m1 and m2 and we combine them using the coefficients +1/(t2 -t1) for m2 and -1/(t2 - t1) for m1. IN GENERAL, if we have n INDEPENDENT random variables Y(1), Y(2), ..., Y(n) and we make the linear combination c(1)Y(1) + c(2)Y(2) + ... + c(n)Y(n), then its standard deviation is the square root of the sum of c(i) squared times var[Y(i)] with the sum going from i=1 to i=n. Thus, in a regression we estimate the slope using c(i)'s which give greater weight to (make greater use of) Y observations that are the tw extremes of the X ange and less to those at the centre of the X range. The text simply gives the formula for the standard error of the slope in equation 5.9 p. 53, without any heuristic explanation. See JH's 607 notes on correlation and regression for comments on the "anatomy" and "reliability" of a slope. Note that in the slope (eqn 5.4 p. 48) the weight given to the i-th Y is a (scaled) version of the deviation of X(i) from Xbar. Thus, the estimate of the slope is a linear combination of the Y's. Likewise, the intercept (equation 5.5, p. 49) is (Ybar - beta1_hat) times Xbar. Now Ybar is a linear combination of Y's; beta1_hat is another combination of the Ys, and Xbar is just a constant. Thus beta0_hat is a linear combination of the Y's; so its variance is a sum of terms of the form C(i)-squared times S-squared, where C(i) is the weight associated with Y(i). It is interesting that Moore and McCabe, in Ch 2 of their introductory text, treat simple linear regression first as an exercise in descriptive statistics, just like they do the summary statistics for a single Y variable. Only much later, (Ch 10/11), after they have introduced standard errors, confidence intervals and P-values do they come back to inference for regression. A similar treatment of regression lines as descriptive summaries would have been appropriate in KKMN Chapter 3.