NOTES ON KKMN CHAPTER 5 jh 1999.05.29 5-1 Preview =========== Note that in this chapter and indeed in all the chapters up to 21, it is implicit that Y is measured on a continuous (or effectively continuous) scale. 5.2 === "Finding the curve that best fits the data" was a purely MATHEMATICAL problem long before it became a STATISTICAL problem. The use of LEAST SQUARES in section 5-5-1 as the fitting criterion does not involve statistical assumptions or models, but it does involve a PARTICULAR definition of how we rank the fits of different lines/curves. If we were to treat the problem as simply finding a line/curve that is somehow "close" to the data points, then presumably it wouldn't matter if we looked at the data with the X values on the vertical axis and the Y values on the horizontal axis. But, in regression, it DOES matter which variable is plotted on the vertical axis. The authors could have made this clearer if they had used a non-symmetric phrasing: they speak of approximating the "true relationship BETWEEN X and Y." It might have been better to speak of how, at a particular X value, the location and possibly the spread of all of the possible Y values is related to or "predictable from", or "driven by" the X value. Note also that the object of interest is not the relationship in the OBSERVED Y's, but rather in the UNOBSERVED Y's. That's why the authors speak of the TRUE relationship; this idea that we are trying to learn about the mathematical relationship BEHIND the observed data, and that statistical inference is about the data we did NOT observe, is a key one to keep in mind throughout the course. If we weren't interested in the "behind the observed data" situation, and only in the empirical values, then there would be nothing more to say after one has plotted the data. The only justification for pursuing a regression model would be if the data were so voluminous that the line or curve was viewed simply as a data-summarization technique -- a bit like is done with data-compression techniques that involve some (negligible) "loss" when the compressed data are "unpacked". 5-2-1 ===== The authors give the impression that, for any one unit, the "X" and "Y" values are always observed "simultaneously", as happens in what epidemiologists call a "cross-sectional" study. Whereas this may often be the case, it is better to think of the units as having first been SELECTED as for their X values, and then MEASURED (observed) as for their Y values. This viewpoint serves two useful purposes (1) it emphasizes that the X values are not "random" in the same way as the Ys, and (2) that if one has a choice, one can be efficient in which X's to study. Imagine that an investigator was interested in the relationship between height and weight (or more correctly the influence of X = height ON Y = weight). Suppose (s)he could obtain a list of person's heights from say a file of drivers' licenses. Then it makes more sense to deliberately study say 5 persons at each level of height, rather than taking a blind sample that gives the naturalistic distribution of heights in the source. The heights of the randomly chosen persons in the study are determined by (a) nature (b) the stratified sampling scheme, if used and (c) the random selection mechanism. Nevertheless, in the regression analysis, this randomness in the heights is not used. However, stratified selection is much more efficient (makes for less variable estimates of the "slope" of weight on height) in this situation than the use of a blind (unstratified) selection. It is somewhat ironic that, from a biologic viewpoint X = height is a variable that the "owner" has little control over, whereas the Y =weight is more "elective" and somewhat more under the "owners" control. One last point on the fact that we treat the X's in a regression as "knowns": in actual fact, the heights in the drivers' license database are self-reported and subject to both random errors of measurement, and, even if measured well, non-random (!!) errors of reporting. Unfortunately, the effects of such "errors in X" are typically ignored in reports of regression analyses. In the example of the regression of Y = blood pressure on X = age, there could also be errors in the reporting and/or recording and/or computerization of age. For persons who are the same age, the observed variation in BP across these persons is a composite of several sources (i) true inter-person variation (ii) true intra-person variation (again biological) and (iii) measurement and recording errors. In many applications, it is not possible, without extra work, or outside information, to separate these three components. Figure 5-1 ========== Note the // and \\ marks to indicate that the Y and X axes do not display zero. This is good practice. I doubt if any of the plotting facilities in the commonly used software packages allow for such marks. 5-2-2 Basic Questions ===================== Note that at this stage, the authors do not really give the purpose of the "model" (line/curve/ ...). Is it to try to "get close" to the Y's? Is the line or curve supposed to be an estimate of the "centre" of the Y data at each X value, and in what sense do we mean "centre"? Is the object to fit these particular data well, or to estimate a model for all of the data not shown in the figure? Why are we fixated only on the "centres" (however defined) and not on describing how (vertically) VARIABLE the data are about these "centres". 5-2-3 ===== No comment!! (But lots later!) 5-3 === It is interesting that a few hundred years ago, scientists would use A and B where we now use X and Y, and X and Y where we use A and B i.e, they used X and Y for the coefficients and A and B for the variables. In high school, I learned the equation of a line as y = mx + b, i.e. with m for slope and b for intercept. Other commonly used letters are a for "intercept" and b for "slope", i.e., y = a + bx. A good example of mathematical straight lines are the relationships between temperature in Fahrenheit and Celsius i.e., F = 32 + (9/5)C or C = (5/9)(F - 32) = -160/9 + (5/9)F --------- Q: At what temperature, sometimes found in Canadian Prairie winters, is the value the same in F and C? ---------- Note that this perfect purely mathematical situation is the only time that the slope of the regression of F on C, 9/5, is the EXACT inverse of the slope of the regression of C on F, 5/9. Wherever, because of imperfect measurements, or biological variation, or whatever other reasons, the data points do not lie exactly on the line, the slopes are not the exact reciprocals of each other. Note that it is not always helpful to write the equation in terms of the intercept at X = 0; often, as in the example of C as a function of F, it is better to start F at 32, and then show how much C moves up from there, rather than from F = 0. This is not just because C = 0 at F = 32. One might want to start "in the middle" with say the usual ambient Canadian summer (or Addis Ababa all year) temperature, say F = 68, and write the relation as C = 20 + (5/9)(F - 68) Likewise, if our concern was with translating body temperatures, we might start at F = 98.6, giving C = 37 + (5/9)(F - 98.6) The point is that one can start anywhere, so why not start at some relevant value, like the effective BEGINNING or MIDDLE of the X values. It doesn't have to be the middle exactly ... any convenient location is fine. If you read the article by mosteller (on the web page), note how he writes of the "intercept at xbar". Good examples of this are some equations in the Montreal Gazette in 1995. (see article under Chapter 11 below) "IDEAL" body weight as a function of height: IDEAL WEIGHT = 100lbs + 5lbs for every 1" over 5 feet, if female = 106lbs + 6lbs for every 1" over 5 feet, if male i.e. IW(lbs) = 100lbs + 5(H" - 60"), if female IW(lbs) = 106lbs + 6(H" - 60"), if male We could equally have written these as IW = -200lbs + 5(Height in inches), if female IW = -254lbs + 6(Height in inches), if male but they wouldn't be as useful in this form! ------------ Exercise: Convert these equations to kilograms and centimeters. ------------ The exercise will probably show you that the equations above are not technically accurate, since the units do not match all the way across. Keeping track of the correct units makes it clear what the units involved in the slope are. The left hand side is in lbs. The "intercept" of 100 on the right must also be in lbs. The height is in inches; to make sure that the product of the (H" - 60") and the 5 is again in inches, we need to say that the slope is not the UNITLESS 5, but rather that it is 5 lbs/1". Then, the product of the lbs/inch and inches yields the desired lbs in the intercept, and on the left hand side of the equation. Incidentally, the data in Figure 5-1 remind me of the "rule of thumb" of Blood Pressure | my age = 100 + my age Technically speaking, this should be BP (mm) | my age in years = 100mm + (1mm/yr) age in years Note that mathematically this is the same as 125mm + (1mm/yr)(age - 25) Going back to the use of Greek letters beta_0 and beta_1. If one were just speaking mathematically about equations of straight lines, one would not use Greek letters for the slope and intercept. The reason the Greek letters are used here is to denote statistical PARAMETER values, just like we use Greek letters mu; pi and sigma (think of beta's as difference in mu's, divided by differences in X's!!). In real life, we will never be able to observe these parameter values. They are technically "UNKNOWABLE" Instead, since all of our datasets are FINITE, we will only be able to derive ESTIMATES of the parameters, using the STATISTICS calculated from the observed data. Moore and McCabe remind us to associate "Parameter" with "population" (or Universe, or IN THE ABSTRACT) and "Statistics" with "Samples". In the KKMN text, a parameter estimate is denoted by the symbol for the parameter (beta_1 for example) with a hat ("chapeau") over it. Some epidemiologists find statisticians' use of Greek letters (and hats) pretentious, and use instead capital (upper case) letters for parameters and the corresponding lower case letters for the statistics (estimates of the parameters). Thus for example, where a statistician might use the Greek letter pi for the theoretical proportion and pi_hat for an estimate of it, these authors use P and p. Likewise their use "OR" for the theoretical odds ratio and lower case "or" for its empirical value. (Theoretical statisticians denote the theoretical odds ratio by the Greek symbol psi). This same, quite appealing, scheme of using upper case for the theoretical and lower case the empirical (sample) version carries over nicely for regression coefficients --- with B0, B1, B2, ... for the theoretical (unknowable) coefficients and b0, b1, b2, ... for their empirical counterparts. [Some statistical texts compromise, using Greek betas for parameters, and lower case b's for estimates of them!!]. Quite apart from any pretentiousness, and the math-anxiety engendered by fancy Greek symbols, and hats, there is another practical reason for the simpler "B0, B1, B2" / "b0, b1, b2" usage -- they can be written in plain ASCII text!! Moreover, the estimates in the output from computer packages (ie the estimates of the parameters) never come with hats on them. The only exception is on page 50, where the authors have typed beta_0_hat and beta_1_hat (in Greek and with hats) onto the output produced by SAS. Note there that in the regular output each estimate is called just that -- "parameter estimate". 5-4 Assumptions for Straight-Line "Model" ========================================= Para 2. The authors cannot shake their obsession with trying to predict individual Y's. If we measured Y and X for every individual in a universe, we could get perfectly precise estimates of the mean Y for persons with each value of X. I wouldn't call these "approximations"; rather I would say we estimated the different (X-specific) MEAN responses very well. But the fact that there are a large number of individuals at a given value of X doesn't make them any less (or more!) variable as to their Y values -- i.e. they remain INDIVIDUALS. We are a long way from being able to predict (or explain) why -- even if they had the same values on the 10 most important predictors -- babies differ from one another with respect to birth weight. para 3: The authors focus on the parameters of the STRAIGHT LINE i.e. of the X-specific CENTRES of the y distributions, and ignore for now what else one needs in order to "predict" where individuals will be around these centres -- some X-specific measure of VARIATION about the X-specific mean Y. The 5 "assumptions" or conditions "needed" to make inferences concerning the "true" or "theoretical" line, are overly stringent. If one truly is concerned with where the line is, one can in certain situations make valid inferences with somewhat less demanding assumptions. This is especially the case with respect to the "normality" (alias "Gaussian-ness"). If the "independence" and "homoscedasticity" assumptions are not fulfilled, the main casualty is that confidence intervals for the parameters of the true (theoretical) line may be somewhat inaccurate. (Homoscedasticity would be important if one were constructing growth curves, where the variation in height at the younger end of the age scale is less than at the older end). Inaccuracies in standard errors and confidence intervals for the parameters of the line (and thus for the MEAN Y at a given X value) can generally be "fixed" without having to throw out what may be a perfectly good straight line assumption just to try to satisfy the other requirements. It is worth examining the logic behind steps 1-6 on page 41. I would not fuss about "normality" at step 3, especially if in steps 4/5 I might decide that a straight line was inadequate, and I was going to try a more complex model. I would leave "normality" and "homoscedasticity" to the end, and even then I would put them subservient to the fit of the line or curve. (Incidentally, it is not clear how in step 5 one can "repeat step 3": (a straight line). The steps are more accurately described in Figure 5-2 than in the text above it. 5-4-1 (assumptions) =================== 1 - "Existence" =============== I am not sure I really understand why this condition wouldn't always be satisfied. To me, this "assumption" is really a DESCRIPTION of the regression situation itself (X-specific Y means, which we hope to find a pattern for). Incidentally, one shouldn't insist that such Y|X Distributions have to exist for ALL possible X's in the range of X. For example, if Y was birthweight, and X was birth order, one wouldn't insist on investigating the mean and standard deviation of Y when X = 1.5 or X = 2.3. If each mean (Y|X) is a "dot", and even if each mean is based on a very large number of observations, the idea of a regression is that we do NOT "join the dots", (as was done in Figure 5-4), but rather that we find a smooth line or curve (a function of X) that is a parsimonious approximation to the sequence of MEANS. I mentioned in an earlier chapter that the "dots" could have been some other measure of the "centres" of the X-specific Y distributions. If it weren't for the intractability of working with them, medians would have been a useful alternative. Indeed, the published data on Canadian gestational-age-specific birth weights used medians rather than means -- and 10th and 90th percentiles rather than standard deviations. In this situations, the data are so extensive, and the pattern of centres so smooth a function of age, that there was no need for any further "smoothing" by regression. And in any case, the focus is always gestational-age-specific! 2- Independence =============== The invalid statistical conclusions are in respect to interval estimates (confidence intervals) rather than to point estimates. 3 - Linearity ============= In one sense, of all 5, this is surely the most important. If one cannot well appropriate (estimate) the "centres", then what hope does one have of going further and describing the range of variation of the individual Y's? On the other hand, we don't want the best to be the enemy of the good [have I got this the right way round?]. There is a great danger of overcomplicating the fitted models, and even of being led astray by "chasing" every, single twist and turn. The complex function fitted to average alcohol consumption versus age [ref] in a population and of WBC as a function of time in an individual are two examples of "computers over reason". Likewise, if the data are extensive, the pattern of means may be an "adequately accurate" straight line function of X, but a formal statistical test may indicate that a straight line "does not fit." The departures from linearily may be extreme in p-value terms because of the large n on which the test was based, but of no practical importance in the big scheme of things. In addition to equations 5.2 and 5.3, it might be good to make the regression even more explicit: Y|X = mu(Y|X) + individual variation mu(Y|X) = B0 + B1.X [am using "." for multiplication here] Think of 5.2 as the "systematic" part, onto which one adds "individuality". As discussed earlier, the authors are being a bit unrealistic in giving the impression (2nd para, p 45) that the large variation in Y = birth weights across infants born at say X = 37 weeks are "errors" from the average birthweight of 37-week olds. The term "deviations" is a somewhat less "loaded" term. Even then, people could be forgiven for imagining that statisticians with their preoccupation with errors and deviants deserve the accolade of the "dull" or morbid scientists more than economists do. 4 - Homoscedasticity [equal variances] ====================================== Statisticians have a habit of thinking and writing about variation in the squared scale. They use the word "variance" in a technical sense, for the square of the standard deviation. There are good theoretical reasons for working this, the most important being that the variation -- whether in the natural scale or in the square -- of an aggregate statistic usually involves the sums of the variances associated with the individual components. One cannot combine standard deviations directly; one must combine their squares. Then to get back to the natural scale, one must take the square root. For the purposes here, the homoscedastic assumption can equally be written sigma (Y|X) = sigma, for all X. Homoscedasticity or rather the far more common heteroscedasticity, is not a first-order worry, UNLESS the concern is with establishing X-specific percentiles for individual values of Y (as in growth charts). Mercifully, this textbook is a lot less fussed about heteroscedasticity than others (e.g. Neter et al). Moreover, one of the "fixes" (transformations -- see p. 252) can ruin an otherwise perfectly reasonable linear relationship. Instead, in such instances one can preserve the linearity and give lesser weights to observations that are more variable (p. 250). Assumption 5 Normality (Gaussian-ness) ====================================== The author's plea to give considerable leeway before switching from an otherwise reasonably-filling model should be heeded. The one situation where the assumption is CRITICAL is when the fitting goes beyond the usual focus on Centres to the estimation of PERCENTILES of the distribution of INDIVIDUAL Y values. Again, a good example is the construction of growth charts, where not only are the individual height variations wider at the older than the younger end, but the variations may not even be symmetric -- let alone Gaussian! In situations where the focus IS on the CENTRES (i.e. on the line or curve), and the sample sizes are reasonably large, the Gaussian-ness or non-Gaussian-ness of what the authors call the E's becomes a non-issue. This is for the same reasons that the Z- or t- distribution is a reasonable reference distribution for STATISTICS like a mean, or a difference of means. The Central Limit Theorem, APPLIED TO THE E's, ensures that even if the Es are not Gaussian, aggregate statistics calculated from them have closer-to-Gaussian variation. I emphasized in an earlier chapter that estimated slopes (regression coefficients) were linear combinations of Y's; each Y in turn is a sum of a constant but unknowable, mu(Y|X) and an E. So the random component of the slope is a linear combination of E's. 5-4-2 Summary & Comments ======================== The first paragraph makes an important point, namely that the Gaussian-ness (and homoscedasticity) are in terms of the E's, not the Y's. All too often, students (and others who should know better) test or inspect for "normality" in the OVERALL dataset (i.e. collapsed over levels of X) rather than X-SPECIFICALLY. Recall my comments about the near Gaussian-ness of GENDER-SPECIFIC adult heights, but the clear non-Gaussian-ness of the heights of undifferentiated adults. The second paragraph makes some VERY important distinctions. Please try not to mix them up. It's too bad that the authors, even though they acknowledge the confusion the term "normal" can cause, do not themselves try to avoid the term. I'm not sure where the term "normal" distribution originated, but my guess that it goes back to Quetelet and Galton and the idea of "l'homme moyenne". Lastly, on another historical note, some would argue with giving all the credit for the equation of the "Bell Curve" to Carl Gauss. For portraits of some of those who "discovered" the equation, see www-groups.dcs.st-and.ac.uk/~history/PictDisplay/Gauss.html Stigler's book gives some of the history of the equation and of the large subsequent role of Quetelet and Galton. 5-5 Determining the best-fitting Line ===================================== Eyefits tend to overestimate the slope. see Mosteller article on web page; I will also try to demonstrate this. The reason has to do mainly with the criterion our eye (brain) uses. Whereas the least squares method minimizes the average squared VERTICAL deviation form of the Y's the line, our eye uses instead the PERPENDICULAR distances of the points from the line. Mosteller referes to this as the major axis, or principal component. Thus, our eye would tend to give the same line whether we asked for a fit of Y on X, or X on Y. The least squares line of Y on X does not have a slope which is the reciprocal of the slope of X on Y. The eyefit, using the perpendicular deviations, is usually in-between the two Least Squares lines. 5-5-1 Least Squares Estimator ============================= Note that we are minimizing the sum of the squared deviations of Y from the line. Note also that this is a purely mathematical criterion, leading to a purely mathematical solution. 5-5-2 Minimum Variance Estimator ================================ Note that here the focus is on getting good estimates of beta_0 and beta_1 per se, rather than getting a line that is close to the data. 5-5-3 Least Square Solution =========================== The method, and the proof, date back to just after the French revolution. In one of the most famous applications, those charged with deciding how large was the circumference of the earth (upon which the length of the metre was based) had to reconcile the fact that 21 observations, involving 3 unknowns, didn't "add up", and so must have contained errors. Rather than solving the equations 3 at a time, and averaging the 7 sets of answers, Legendre arrived at the elegant and less arbitrary Methode des moindres quarres. www-groups.dcs.st-and.ac.uk/~history/PictDisplay/Legendre.html A correction regarding computer packages (p. 49): SAS, SPSS, SYSTAT, MINITAB and GLIM are available for Mac computers. Once one has calculated the slope beta_1_hat, or b1, via equation 5.4, it is easy to see how one obtains the fitted intercept via 5.5. One uses the fact that the regression line passes through the point (Xbar, Ybar). Then, to find the intercept, one simply "follows the line" until one reaches X = 0. If Xbar is positive, one travels to the left, by a horizontal distance of Xbar. Since the slope ("rise"/"run") is b1, the vertical drop (or rise) from Ybar is b1 times Xbar, leading to equation 5.5. Equation 5.7 "CENTERED" VERSION =============================== This re-expression is VERY IMPORTANT. I have alluded to it earlier when discussing Fahrenheit as a function of Celsius (and vice-versa). It is particularly important if in fact the data are far from X = 0 i.e. if they are say yearly Y's for the years X = 1970 to X = 1999, or Y = Number of hurricanes to strike the U.S. each decade (X) since the year 1900AD. [data, sas program and documentation in www page for course 626] If we put in the data as X Y 190 (1900-1909) 6 191 (1910-1919) 8 192 (1920-1929) 5 193 (1930-1939) 8 194 (1940-1949) 8 195 (1950-1959) 9 196 (1960-1969) 6 197 (1970-1979) 4 198 (1980-1989) 6 199 (1990-1999) 2 then the fitted equation is (using "ave" for "average") Ave Y {No. hurricanes/decade} = 76.9 - 0.3636 X This is not very helpful, since it first requires one to substitute values for 1900 (X= 190 decades since AD) and 1990 (X=199 decades) in order to know roughly what numbers per decade we are talking about. The intercept (estimate 76.9 of Y for the decade starting at 0 AD) is not of any interest. Moreover, it is of more than doubtful precision, given the large extrapolation error in projecting back that far from a (relatively) short series. Indeed, quite apart from the statistical dangers of back-projection, this example is interesting for its illustrating of the numerical errors caused by rounding The equation is 76.9 + 0.3636X If we substitute X=190, and use 4 decimal places in the slope ie -0.3636, we get a fitted Y of 7.8 for X=190 and 4.5 for X=199. If we use just two decimal places for the slope, ie. b = -0.36, we get 8.5 for X=190 and 5.3 for X=199. Small errors in the slope cause big difference when it is used to "project" the line forward from the 1st decade of the first millennium. How about STARTING at the year 1900 (decade 190)? If we put in the data as X Y 0 (1900-1909) 6 1 (1910-1919) 8 2 (1920-1929) 5 3 (1930-1939) 8 4 (1940-1949) 8 5 (1950-1959) 9 6 (1960-1969) 6 7 (1970-1979) 4 8 (1980-1989) 6 9 (1990-1999) 2 Y {No. hurricanes/decade} = 7.8 - 0.3636 decades since 1990 You see that it makes more sense to set our "origin" at 1900. And, even if you carry fewer decimals, say Y {No. hurricanes/decade} = 7.8 - 0.4 decades since 1990 you will not create big errors. e.g. 7.8 - 0.4 x 9 gives 4.2 for the last decade, vs. 4.5 if you carry out the calculation with b = -0.3636. Remember this example for when we we come to discuss the structure of the formula (p p53/54) for the precision of the estimated intercept! This example reminds us that the "origin" is arbitrary and that -- contrary to the impression given by the text -- ANY sensible starting point, NOT JUST THE MEAN, works. On a data quality issue: Some of you may have already objected that the Y for the last decade in the series may not be correct, since te decade isn'y quite over. in fact, the data only go up to 1995 (Source USA TODAY, August 1995). For the text of the article on hurricanes, see http://www.epi.mcgill.ca/hanley/c626/ There is one technical statistical reason to "center" the data around X = Xbar rather than say X = Xmin. This is covered later, on pages 245-248 of the text. Output 5-1 (data in table 5-1) ============================== The SAS commands to produce all of the output shown (as well as a lot more not shown!!) are: PROC REG ALL; MODEL SBP = AGE; Most times, you can omit the "ALL" -- and also save some trees! If you wish, you can use selected options. You can reach the interactive INSIGHT facility in SAS via the Globals menu Globals -> Analyze -> Interactive data analysis See the INSIGHT Primer (in Acrobat Reader .pdf format) You can turn on/off the amount of output. Even when one doesn't ask for extra items, most printouts have more detail than the user requires. KKMN annotate the important ones here. However in the interest of completeness, I will go through then all for this first example. The INTERCEP shown in the first row of the output isn't really a "variable" in the usual sense of that word. The program didn't actually follow formulae 5.4 and 5.5 to get the slope of 0.97 ... and intercept of 98.71 ... shown in the last two rows of the table. PROC REG can handle multiple (k) X's simultaneously, but in such cases, the formulae for the various beta_hat's cannot be written out explicitly in closed from. Instead the software uses matrix methods, with the matrices in question having as many columns as there are coefficients (k + 1). The first column is set to XO = 1 for every observation; then the regression equation can be written as E(Y|X1, X2, ..., Xk) = B0.X0 + B1.X1 + ... + Bk.Xk. In our example then we have Y = BP, X0 = 1, and X1 = Age. SAS labels the "X0" as "INTERCEP" when giving descriptive statistics at the beginning, and it labels "b0" as "INTERCEP" when giving the estimated coefficients. Note that the fitted regression goes through the point X = AGE = 45.13 (XBAR), Y = SBP = 142.53 (YBAR). To me, the data "start" in the MIDDLE of Fig 5-1. From the Analysis of Variance table, concentrate first on the corrected total sum of squares of 14787. This is nothing more than the sum of the squares of each Y from Ybar = 142 (I'm truncating some of the extra decimals shown in the printout). The 30 such squared deviations FROM (note) YBAR is 14787. Divide this by the usual 29 degrees of freedom (only 29 of the 30 deviations are "independent") and you get 14787/29 = 509.9, the S-squared(Y) in the descriptive statistics. We might prefer to think of the 30 BP's having a SD equal to the square root of 509.9, or 22.58. At this stage, the only other statistic to note is the Mean Square for Error of 299.7, and its square root (Root Mean Square Error, abbreviated to Root MSE) of 17.3. This says that whereas the "global" variation in SBP on Fig 5-1 can be measured by a SD of 22.58, the "age-specific" variation in SBP is 17.3 i.e. 23.5% less than the "non-age-specific" variation. Put another way, this says the a little less than 77% of the gross SD remains "unexplained". For technical statistical reasons, reductions in VARIANCE, and the percent of VARIANCE that "remains" are more commonly used. So here, it would be more useful to report that 76.5% x 76.5% = 57% of the variance remains and 100% - 57% = 43% is "explained". Needless to say, the reduction in variance looks bigger than the reduction in standard deviation. But bear in mind how you are going to say this in the context of say income or SES "explaining" a certain percentage of the variance in fertility i.e. the overall variance in fertility is maybe 0.67 square children per square woman, and only 0.5 if we consider within SES-group variation in fertility. Some 25% of the variance is explained, but the reduction in the standard deviation is only 1 - sqrt [0.75] = 13%. I did these calculations assuming half the women were in one SES category, and that of the fractions with 0, 1 and 2 children were 1/4, 1/2, 1/4 whereas in the other half of the women, the fractions with now 1, 2 and 3 children were 1/4, 1/2 and 1/4. You might want to check my arithmetic!! The estimates of the "intercept" and slope are 98.7 and 0.97. It goes without saying (but I'll say it anyway!!) that it is not safe to call the 98.7 our best estimate of the SBP of newborns. Nor for that matter should one say from these data that "SBP increases as age increases". Mind you, the equation fits well with the "100 plus your age" I heard once. When you add in the Standard Errors (or if you like, the Statistical Uncertainty) of the estimates, the "100 plus your age" is quite a good round approximation to the "(98.7 +/- 20) plus (0.97 +/- 0.42) times your age" one might report from the regression analysis using say +/- 2 standard errors for each coefficients. [In fact, the slope and intercept estimates are not independent of each other: if one is an overestimate, there is a greater than 50% chance that the other is an underestimate --- but more on that later!] You can think of the 98.7 as an estimated mean SBP for persons aged 0! Better still, rewrite the equation as Ave(SBP|age) = 142.5 + 0.97 times (age minus 45) Clearly there is no point in testing the 142.5 against 0 (Null hypothesis is false, as long as the subjects are alive!). Think of the 0.97 (or better still 9.7) as the estimated difference in the average SBP of two populations 1 (10) year(s) apart in age. You can see this even better if in Fig 5-8 you take ages 20 years apart, since there is a vertical distance of 20 min (1 tick mark) for this horizontal difference of 2 tick marks. 5-6 SSE & the estimator of the (common) X-specific variation of E ================================================================= Note that the Greek sigma squared refers to the X-specific variance of the E's (And NOT the overall variance of the Y's). It makes sense then that this sigma-squared is estimated using the deviations of each Y from its estimated X-specific mean. It is the same logic as when in course 607, we estimate a "regular" variance using squared deviations from a single mean Y or when, in connection with a t-test on the means of two groups, we estimate a common variance by pooling the within-group deviations. The difference here is that EACH deviation is from a different mean, given by the fitted line. Since we assume E has the same "amplitude" no matter what the X, we "collect" or "pool" the deviations (residuals) from the line. If the 14787 in the example is a SUM of 30 squared residuals or errors, and if 28 of them are independent, then dividing the SSE (sum of squared errors) of 14787 by 28 gives an "average" squared deviations (error) of 299.76. That is why the column label is "Mean Square". Putting this label together with the row label (Error), we get Mean Square Error (MSE), or average squared error. The printout doesn't explicitly label the 299.76 as the MSE, but it does label the square root of 299.76, namely 17.3, as the "Root MSE". Using the label MSE would save some steps later, S-squared Y|X = MSE, and its square root S[Y|X] = RMSE are used extensively in the formula for inferences concerning the slope, and the regression line (next four sections). Why in this example do we divide the SSE by n-2 to get the (average or) mean squared error MSE? In 607, you learned to divide the sum of squared deviations by n-1 to get an unbiased estimate of sigma squared. With n = 1 observations, there is no opportunity to assess variance; with n = 2, you have 2 deviations D1 = Y1 - (Y1 + Y2) / 2, and D2 = Y2 - (Y1 + Y2) / 2, BUT since D1 and D2 are mirror images of each other and add to zero, there is really only 1 INDEPENDENT assessment of variation. With n = 3, you have 2 independent deviations, etc. But in these examples, each Y was an estimate of the SAME mean mu, and sigma-squared was the average squared deviation about mu. Ybar is our best estimate of this single mu, and Yi - YBAR provides an estimate of sigma. But now, with linear regression, each Y varies about a DIFFERENT mu. The fitted line is our best estimate of the different mu's, and Yi - LINE provides an estimate of sigma. When there is but one mu, it takes only one linear combination of the Y's, i.e. (1 / n) Y1 + (1 / n)Y2 + ... to estimate it. The remaining n - 1 combinations of Y's can be used to estimate sigma. When there is a "line of mu's", it takes two combinations of the Y's to estimate the line of mu's. One combination goes to estimating the slope, the other the intercept. That leaves n - 2 independent pieces of information, that can be used to estimate the (common) sigma. To make these ideas concrete, fill in the missing values in the following 6 situations: situation 1 *********** i Y mu_hat E_hat E_hat squared -- -- -------- ------- ----------------- 1 3 5 -2 4 2 ? 5 ? ? -- -- -------- ------- ----------------- # of INDEPENDENT E_hat's 1 ESTIMATE OF Sigma-Squared: ?? situation 2 *********** i Y mu_hat E_hat E_hat squared -- -- -------- ------- ----------------- 1 9 7 2 4 2 4 7 -3 9 3 ? 7 ? ? -- -- -------- ------- ----------------- # of INDEPENDENT E_hat's: 2 ESTIMATE of Sigma-Squared: ?? situation 3 *********** E-hat i X Y mu_hat (Y - mu_hat) -- --- -- --------- -------------- 1 3 5 5 0 2 7 13 13 0 -- --- -- --------- -------------- NUMBER OF INDEPENDENT E_hat's: 0 FITTED LINE: mu_hat = -1 + 2X = 9 +2(X - 5) situation 4 *********** E-hat i X Y mu_hat (Y - mu_hat) -- --- -- --------- -------------- 1 2 9 7 2 2 3 ? 10 ? 3 5 ? 16 ? NUMBER of INDEPENDENT E-hat's: 1 FITTED LINE: mu_hat = 1 + 3X situation 5 *********** E-hat i X Y mu_hat (Y - mu_hat) -- --- -- --------- -------------- 1 1 3 2 +1 2 2 4 6 -2 3 3 ? 10 ? 4 6 ? 22 ? NUMBER of INDEPENDENT E_hat's: 2 FITTED LINE: mu_hat = 2 + 4X = 10 + 4(X - 3) situation 6 *********** E-hat i X Y mu_hat (Y - mu_hat) -- --- -- --------- -------------- 1 0 1 2 -1 2 2 ? 8 3 3 4 ? 14 ? 4 6 ? 20 ? 5 7 ? 23 ? FITTED LINE: mu_hat = 2 + 3X (Don't spend TOO LONG on this one!) 5-7 Inferences re slope & intercept ===================================== I have already referred to the fact that the slope and intercept estimates are linear combinations of the Y's. Thus, if the Y's have Gaussian variation, so then will the parameter estimates. But even if the Y's are not Gaussian, the parameter estimates, being linear combinations, will have closer-to-Gaussian distributions, and for all practical purposes Gaussian distributions when n is large (usually 30 or more, but 50, or even 100, or more) if the distributions of the E's is VERY highly skewed). The denominators of equations 5.9 and 5.10 are the standard errors of the slope and intercept estimates. In medical publications, and in computer printouts, and in some modern texts, they are referred to directly as SE's. The "S" notation eg. S_subscript_beta1_hat is very cumbersome. Instead, one can write SE(slope estimate), etc. The n - 2 degrees of freedom comes from the fact that the MSE is calculated using n - 2 "independent" residuals and the square root of this (RMSE) is substituted for sigma in the formula for the standard deviation of the estimator. KKMN give the (theoretical) standard deviation of the slope estimator as sigma / (Sx times the sq. root of n - 1) Don't make a big deal of the n - 1 here! It is always better to think of standard errors of statistics as having the square root of the sample size in their denominators. See my 607 notes on the correlation / regression chapter of Moore and McCabe (M&M Chapter 9) for a heuristic approach to understanding the structure of the standard error of the slope [I write about factors that affect the "reliability" of the slope]. 5-8-2 The intercept ===================== I "second" the "In any case, the intercept (zero or not) is rarely of interest". Given that, I urge that you, whenever possible, to rewrite the regression in the "centered form mu_hat(Y|X) = Ybar + beta1_hat (X - Xbar) 5-9 Inferences concerning the line ==================================== I like the way the authors write the equation for the theoretical (but unobservable line) mu (Y | X) = beta0 + beta1 (X) It is a pity that they don't continue in this vein and write equation 5-13 as mu (Y | X)_hat +/- "t times its SE" (i.e. +/- t.SE) 5-10 A new value of Y at X0: ============================ The limits for this new Y (for an INDIVIDUAL) are often confused with the confidence limits for mu (Y | XO). The book by Neter has good exercises which help distinguish the two concepts [my best example is what to say to the judge regarding the alcohol and eye movement data, or what to tell parents as to when, on the basis of some predictive model, when their infant might first start sleeping through the night]. See exercises on Chapter 5. In the problems (starting at p. 60), the examples and/or wording are not always that compelling as to whether prediction for the individual, or estimation of the mean of all individuals at that X value) is the more appropriate) task. For example, in problem I(f) p. 62, the object is the MEAN response, so the wording would be better if it referred to the mean response for 8-day-old chicks (it doesn't make much sense to ask about the MEAN response for ONE chick! Problem 6 (p. 69 - 70) is a good example of how the focus might well be on the individual, but the question posed is about the MEAN duration of sleep in children of a certain age, after all, what good are the confidence limits on the mean when parents are trying to tell their child that (s)he is "out of line"? The is the same tendency in the medical literature to present confidence intervals for the mean Y at a given X when the focus is in the individual patients. Can you find some examples? (Hint: look at the presentations of data on method-substitution studies eg. pulse oximetry versus blood levels, bilirubin by noninvasive versus invasive methods).