Notes on KKMN Chapter 8 See - Moore and McCabe (IPS), Chapter 9.2 (2nd Ed) or Chapter 11 (3rd Ed) See also: - my notes from 607 (on the web page) - the book Statistical Methods for Comparative Studies by Anderson S, Auquier S, et al. If gives a very good overview (non-mathematical) of how important multiple regression is in adjusting for confounding. - the article Appropriate Uses of Multivariate Analysis (Hanley, Annual Review of Public Health, 1983) in www/epi.mcgill.ca/~web2/courses/c697/datasets.html ... It is important, before getting into the technicalities in Chapters 8 onwards, to understand WHY we use multiple regression. 8-1 PREVIEW =========== 1. The "best" model: that might not be the ultimate aim. 2. Visualization. It becomes less of a problem if one focuses on 2-D plots and on the residuals. 3. Meaning/Interpolation. Even in lower-dimensional models, one needs a balance between pure expediency, mathematical fits, approximations, and biological reality. I think it was the statistician Tukey who said something like "All models are wrong, but some are more helpful than others". 4. Computation. In fact, if one had a lot of patience, one could fit a multiple regression as a series of simple regressions. See the Weight[age, height] example in my notes on multiple regression in 607, M & M Chapter 2/9). From it, can you see how this could be extended to 3 or more "X" variables? 8-2 MULTIPLE REGRESSION MODELS ============================== As the authors make clear, ultimately, the "size" of the model is the number of terms, and thus coefficients, in the equation, not the number of "basic" variables. Miettinen in the Chapter on Regression Analysis in his textbook, Theoretical Epidemiology, makes a distinction between subject characteristics (determinants) and their translation into STATISTICAL VARIATES the actual terms in the regression model. A single determinant; smoking, might lead to 4 or 5 terms in the regression model. Obviously, one cannot use a very high order polynomial in a single basic variable to "make a silk purse from a sow's ear". Example 8.1 =========== It worries me a bit that the authors start out with models with up to 5 terms (the last equation at the top of p. 113, involving X1, X2, their squares and their product) for a little dataset of n = 12 observations. As we will discuss in later chapters, there is a big danger of "overfitting" in such situations, and what is "the best predictive model" for these 12 observations may be very poor in other individuals of the same type as studied here. 8.3 GRAPHICAL LOOK AT THE PROBLEM ================================= It is going a bit outside of regression models, but another way to think of the case of k = 2 is to think of a digitized image, such as you might obtain if you scanned a photograph into a computer file, using say 256 levels of gray. Then each horizontal and vertical pixel has a color (Y) between 0 (say white) and 1 (black), or between 00000000(0) and 11111111(255) if using the 8-bit color scale. The "address" of a particular pixel is in terms of its distance from say the bottom left corner (H pixels horizontally, V pixels vertically). If the pattern of colours was sufficiently smooth, one might be able to "fit" (predict) the colours Y of each pixel using function of H and V. Admittedly one might need a very large number of terms, involving H and V, in the model. In practice, regression functions of H and V are not the most efficient way to "capture" or approximate the signal in an image, but the idea is still the same -- the prediction of Y (here colour) from H and V, with as little "error" as possible. Another way to think of Figure 8-2 that will help when it comes to the "stability" of a fitted model is to think of the surface as a trampoline (or flat hammock!) that is being held in place by a series of ropes, with each rope secured to a data point. Imagine how the effect of removing one data point (i.e. cutting one rope)! If the data points ("support" for the surface) have co-ordinates along the (X1,X2) diagonal i.e. mostly along a "South-west to North-east", corridor, there will be considerably more instability. We will use this idea again when we come to "collinearity" in Chapter 12 - section 5. How much can we learn from X1 and X2 separately? (p. 114) ========================================================= It is IMPORTANT at this stage to UNDERSTAND what is fundamentally different between the following two examples (i) the 1986 Canadian birthweight data discussed in the notes I had made for Section 9.2 (multiple regression) in 607 [weights as a function of X1 = gestational age and X2 = gender] (iia) the weights of Alberta children aged 11-16 as a function of X1 = age and X2 = height [in the same notes]. (iib) example 8.1 in the KKMN text, again with weight as a function of age and height. In the birthweight data, one can estimate the gender difference pretty well just by subtracting the average female weight from the average female height. One doesn't need to do so "age- specifically" (i.e. taking the difference within each age separately, then averaging, in some way, the age-specific differences). Likewise one can look at the effect of age using the two genders combined and get an answer that is very similar to those which one would get if one looked at the age effects gender-specifically (i.e. within each gender). However, the same indifference of the answers to how one estimates beta_1 is not true for the "weight as a function of age and height" data. The slope of weight on height is 3.363 lb/inch if one uses ALL of the data, WITHOUT taking age into account. But the slope is only 2.725 lbs/inch if one takes age into account. You see the same pattern in the textbook example: 1.07 lbs/inch if one ignores age (Model 1, page 121) but only 0.72 lbs/inch if one takes age into account (Model 4, page 122). The reason that the slope estimate changes in examples (iia) and (iib) but does not in (i) has to do with the X2 profile of those involved in the contrasts implied by the "slope". Effectively, -- or conceptually -- the slope of 3.36 lbs/inch was obtained by contrasting weights of those of a certain height with weights of those 1 inch taller. But those who are 1 inch taller also tend to be older, so it is not a "pure" contrast, whereas the 2.725 lbs/inch is (if our model is correct!). The same applies to the Alberta children ... taller children tend to be older, so again the slope of weight on height is not "pure". The same applies if we turn Example 8.1 round the other way, and look at the slope associated with age, when we do not, or do, take height into account, i.e. 3.64 lbs/year if we ignore height and 2.05 lbs/year if we take it into account. But we see no such patterns with the birthweight data ... we can estimate the "pure" gender difference even when ignoring age, and the "pure" age effect when ignoring gender i.e. pooling all the data together. The reason this works here is that when we compare weights of ALL of the infants of a certain gestational age with the weights of ALL of the infants who were born 1 week later, the male: female mix is about the same each week. Likewise, if we compare the weights of ALL boys with the weights of ALL girls (regardless of gestational age), the boys have a gestational-age profile very like that of girls -- i.e. it is not that boys are born later! 8-4 ASSUMPTIONS OF MULTIPLE REGRESSION ====================================== There are essentially the same as for simple linear regression in Chapter 5, so I won't belabour the points I made there. As for 2(Independence), I cannot remember if I gave my usual example of how one can read a paper, from a laboratory which you know has only 7 persons, in which there are a total of 20 observations, and in which the methods state that the "subjects" were the laboratory personnel who volunteered for the study! 3 (Linearity). As the authors partly explain, the "linearity" is at two levels. First, STRICTLY SPEAKING, linearity refers to the beta's, not to the X's: the mean is a linear combination of beta's. (See footnote; or see Miettinen p. 217, or my notes on simple linear regression for Moore and McCabe's Chapter 9). Linearity of the betas allows us to fit the beta's by least squares in one pass through the data, and get standard errors easily etc). Second, most users think of linearity in terms of whether -- even if there is just one X -- the conditional mean mu(Y|X) is a straight line function of X. Even with all the other X's in the model, one can still think of "freezing" them at certain values, and concentrating only on the Y variation with respect to the X that one is allowing to vary, and asking whether in this restricted situation, mu(Y|X in question, and all other X's held constant) is a linear function of the X in question. The authors do a good job of explaining that the polynomial regression model mu(Y|X) = beta_0 + beta_1.X + beta_2 X^2 describes a CURVE in X, but is technically a multiple LINEAR regression model just as if the model had been beta_0 + beta_1.Z1 + beta_2.Z2 Indeed it might have been good to have used different letters for the "basic" variables than for the variables actually used in the equation. You will find that in SAS, or any other statistical package, if you want to use higher order terms you will have to "create" them and give them new names. The regression package will not know how you created them! 4 (Homoscedasticity) ==================== I wish the authors had not used equation 8.3, but used the "equivalent" version at the top of the next page. Even though 8.3 is technically correct, most students remember the "Y" part of the "Y|X1, X2, ... Xk" but forget the critical "|X1, X2, ..., Xk" part. IF YOU ARE GOING TO FUSS ABOUT THE CONSTANCY (AND GAUSSIAN-NESS) OF THE VARIATION, THINK OF THE VARIATION OF THE E'S, NOT THE Y'S. 5 (Normality) ============= Again, ask yourself if the central Limit Theorem is likely to "kick in", when using the t-tests, CI's etc... concerning betas. And ask yourself if you want to get the mu's approximately correct before worrying about the sigma's. 8-4-2 (Summary and Comments) ============================ The assumption of a Gaussian distribution is "needed" here to the same extent that it is needed for the 2-sample t-test --- AND HOW BADLY IS IT "NEEDED" THERE? 8-5-3 ===== 1. To make this concrete, suppose one had the dataset with 4 observations i X1 (Gender) X2 (Gest-Age) Y (birthweight) - ----------- ------------- --------------- 1 0 36 (0) 2950 grams 2 0 36 (0) 3050 grams 3 0 40 (4) 3500 grams 4 1 40 (4) 3600 grams ------------------------------------------------------------- (X1 ia an indicator variable for males, and X2 is in weeks) The fitted equation is mu(Y|X1, X2) = -1500 + 100.X1 + 125.X2 If instead we "start" X2 at 36 weeks, so we have X2* = X2 - 36, then mu(Y|X1, X2*) = 3000 + 100.X1 + 125.X2* In the latter formulation, we get beta_0_hat = 1/2 . Y1 + 1/2 . Y2 + 0 . Y3 + 0 . Y4 = 3000 beta_1_hat = 0 . Y1 + 0 . Y2 - 1 . Y3 + 1 . Y4 = 100 beta_2_hat = -1/8 . Y1 - 1/8 . Y2 + 1/4 . Y3 + 0 . Y4 = 125 You can "see" the logic behind the 3 combinations: beta_0_hat is the average of the two females at 36 weeks; beta_1_hat is the difference between the male and the female at 40 weeks; beta_2_hat is the difference between the female at 40 weeks and the average of the 2 females at week 36 -- i.e. Y3 - (1/2)(Y1 + Y2), all divided by 4 because of the 4-week difference in gestational age. 8-5-3 ===== 2. Multiple Correlation Coefficient =================================== This would have been another criterion by which to fit the regression equation: choose the coefficients that maximize the correlation of the predicted and observed values! The multiple correlation coefficient is a useful scalar (1-dimensional) value that gives the big picture without getting into the specifics of how much each X variable contributes to the Y-hat. The comment about the mean of the predicted Y values equalling the mean of the observed Y value is reminiscent of the fact that in simple linear regression, the line goes through (Xbar, Ybar) and that AVE{Y} = AVE{Ybar}. The 2 estimating equations for the simple linear regression case are Sum{ (Y - Yhat) } = 0 and Sum{ X(Y - Yhat) } = 0 In other words, the residuals must add to zero and the weighted average of the residuals (with the X's as weights) must add to zero. For multiple regression, with variables X1, X2, ..., Xk, one simply extends these to k + 1 estimating equations Sum{ (Y - Yhat) } = 0 Sum{ X1(Y - Yhat) } = 0 Sum{ X2(Y - Yhat) } = 0 . . Sum{ Xk(Y - Yhat) } = 0 When one substitutes b0, b1, ... bk and the X's into Yhat, one gets what are called the "Normal Equations" (I don't remember why they are called "Normal") Sum{ Y } = b0 + b1.Sum{X1 } + ... bk.Sum{Xk } Sum{X1.Y } = b0.Sum{X1} + b1.Sum{X1^2 } + ... bk.Sum{Xk.X1} Sum{X2.Y } = b0.Sum{X2} + b1.Sum{X1.X2} + ... bk.Sum{Xk.X2} . . . Sum{Xk.Y } = b0.Sum{Xk} + b1.Sum{X1.Xk} + ... bk.Sum{Xk^2 } 8-5-3 Multivariate Normal Distribution ====================================== I wouldn't worry about this. In any case, it is difficult to imagine k "X" variables and 1 "Y" variable that are jointly multivariate Gaussian. Example 8-2 (Fitted equation with Height, Age, and Age^2.) ============================================== I'm a big proponent of, whenever possible, drawing the fitted equation through the data points. (Statisticians joke about how engineers tend to plot the line first and put the data points in afterwards!) Here, with 1 Y and 3X's (2 of the X's derived from Age), it is not that easy. But, one can choose some specific heights, and graph the fitted average weight vs. Age relation for these selected heights. For example, take heights of 45, 50, 55, and 60. Then the fitted equations are 3.348 + 0.742 (45) i.e. 36.0 + 2.777Age - 0.042Age^2 3.348 + 0.742 (50) i.e. 39.6 + 2.777Age - 0.042Age^2 3.348 + 0.742 (55) i.e. 43.3 + 2.777Age - 0.042Age^2 3.348 + 0.742 (60) i.e. 46.9 + 2.777Age - 0.042Age^2 Now at ages 6, 9 and 12, the 2.777Age - 0.042Age^2 values are 15.2, 21.6 and 27.3, so the fitted values for weight are (in lbs) Height 60" 62.1 68.5 74.2 Height 55" 58.5 64.9 70.6 Height 50" 54.8 61.2 66.9 Height 45" 51.2 57.6 63.3 ==================================== Age 6 Age 9 Age 12 Mind you, SAS INSIGHT (and SAS GLM and REG, if you ask) produce these fitted values directly. But you will need to be able to do this when we come to equations with product terms to accommodate effect modification. For a depiction of the fitted equation, see the www page. See if you can distinguish it from the simpler one for model 4. 8-6 ANOVA for MULTIPLE REGRESSION ================================= In one sense, the extension to k(> 1) variables is trivial. The partitioning of the total variation into the "regression" and "residual" sums of squares is identical to the "1 X" situation. The degrees of freedom partition in the same way. The only apparent difference (and it isn't really, if one keys track of how many degrees of freedom are taken up in fitting the model) comes with the F test for testing whether the fitted equation is better than the "null" model mu(Y|X1, X2) = B0 + 0.X1 + 0.X2 = B0. In other sense though, the fact that k>1 means that there isn't just one question: one can ask. Is a model with X1 and X2 better than one with X1? Is a model with X1 and X2 better than one with X2? Is a model with X1, X2, X3 better than one with X3? etc... These questions will be dealt with in the next chapter. Re. Equation 8.7 (R-Squared) ============================ Again, I would try to put the sum of squares due to the regression in positive terms, e.g. SUM {(Yhat - Ybar)^2} the sum of the squares of the amplitudes of the fitted points. 8.7 Numerical Examples ====================== The authors apologize that they ONLY fitted 6 models, and that many more are possible. But I find that some of the ones they fit, especially going to models 3 and 5 before say models involving powers of height, are not that inspired, and are simply ringing all the combinations. From 1st principles, and just thinking about it, I would argue that weight is primarily a function of height, but that it can't just be linear in height itself. If anything, if to a first approximation, we think of the person as a "cyclinder", wouldn't weight be proportional to volume which in turn is related to height (and "diameter"). The fact that height alone does a reasonable job is probably for the same reason that, for example, the measurement of the perimeter (P) of a rectangle is a reasonably good prediction of the AREA of the rectangle. Within a certain range of rectangle sizes. Take the following example (modelled after one on p 199 of the excellent introductory text "Statistics" by Freedman, Pisani et al., 2nd edition, Norton, New York, 1991). Here are the dimensions of several rectangles. rectangle number L=length W=width P=perimeter A=area (i) (in) (in) (in) (sq in) ----------------------------------------------------------------- 1 11 8.5 39 93.5 2 3 5 16 15.0 3 4 4 16 16.0 4 8 8 32 64.0 5 11 14 50 154.0 6 7 5 24 35.0 7 9 5 28 45.0 Model Equation M_H = - 51.4 + 4.7H + 10.8W R-Square=0.9923 ; Adj R-Sq=0.9884 In these examples, the simple-minded (but incorrect) model is a good approximation to the true model mu[log Area] = 1.log[L] +1.log[W]. The "residuals" in the fitted model are not "errors" of nature but rather are induced by our imperfect model. Remember Tukey's "all models are incorrect, but ..." I don't know if the 12 observations in Tale 8-1 are "real" or "made-up", but if they are real, I would go with a model such as Weight/(Height^3) = B0, or = B0 + B1.Age etc or one might use the square, rather than the cube of height. Note that even if we are strictly mathematical and non-biological about it, the models with more terms in them don't do appreciably better than the one with just height and age. And don't be impressed by the increasing R-squared's. The r-square MUST increase as terms are added -- but, as we will see, the model may do less well at predicting new observations. That is the point of the "adjusted" R-square (Adj R-Sq). And finally, speaking of models, recall Galton's calculation of the relationship between Y = child's (adult) height in inches, X1 = average of the heights of the child's parents, and X2 = the gender of the child (0 if female, 1 if male). Galton wanted to come up with ONE correlation, rather than 2 gender-specific correlations. If we were doing this analysis today, using what WE now call multiple REGRESSION (we took the word from Galton, but changed its meaning!), we might get an equation like mu(Y|X1, X2) = 64 + 0.6 (X1 - 66) + 4.X2. Or, if we "partialled out" the influence of gender on height, we might be left with a single partial correlation of 0.65 between mid-parent and child's heights. But notice what Galton did! If he had ignored gender, and just computed the correlation in all 928 observations pooled together, he would undoubtedly have gotten a lower correlation than the 0.67 he observed. The reason for this is the diagram I give in my "correlations obscured" diagram in 607, with the mid-parent height on the vertical axis, and the child height on the horizontal -- the left cloud of data is for females, and the right cloud is for males, so that the merged cloud shows less correlation than the gender-specific correlation. Instead, "all female heights were multiplied by 1.08 before tabulation". In effect, this is the mathematical equivalent of "turning females into males". In fact, it is a more elegant wat than we do by fitting 64 + 0.6 (ParentHeight - 66) + 4 If male. On the basis of this model, we mathematically turn females into males by adding 4 inches to their heights. But Galton's "proportional scaling" is more elegant, and more biologically appealing than ours: "male equivalent" female Galton's model our model 55" 59.4" 59" 60" 64.8" 64" 65" 70.2" 69"