Notes on KKMN Chapter 9 1999.05.26 Preamble ======== The ideas in this Chapter are unique to Multiple Regression -- there is no analogy with Simple Linear Regression. The topic is introduced in a very clear way right on the first page of the chapter, and the statement "each test can be interpreted as a comparison of two models" on p. 137 is key in not getting lost in the details. I would skip sections 9-3-4 at first reading. Chapter 7 (Multiple Regression - II) of the Neter text is good on this topic too. It includes what -- to me at least -- is a very clear diagram for explaining what it calls the "Extra Sums of Squares". I have included this diagram elsewhere on the web page. 9-1 Preview =========== Of the three types of tests, it is my experience that "2. Test for addition of a single variable" is the most commonly used and the most important. It is also the only one of the three that an be performed using the familiar t-test -- the others require the F-test. Note that when KKMN write about a "single" VARIABLE, they are really referring to a "single" TERM in the equation. If a categorical variable with 3 or more levels, such as blood-type or socio-economic status, is represented by several indicator terms ("dummy variables"), then situation 3 ("test for addition of a group of variables") applies. Thus, one cannot get by just by learning how to deal with situation 2! The unifying feature is the "larger model vs. smaller model" or "full model vs. reduced model" idea introduced at the top of page 137. Indeed all three tests can be put in this common framework. 1. (B1 = B2 = B3 = ... = Bk = 0) ie. B0 + B1.X1 + B2.X2 + B3.X3 vs. B0 2. (B2 = O) ie. B0 + B1.X1 + B2.X2 + B3.X3 vs. B0 + B1.X1 + B3.X3 3. (B1 = B3 = 0) ie. B0 + B1.X1 + B2.X2 + B3.X3 vs. B0 + B2.X2 In 1, all beta's (B's) are being constrained; in 2, only one is constrained; in 3, two parameters are constrained. One important point to note here is that the "smaller" model must be a special case (subset) of the larger one. Thus, for example, the chapter does not deal with a test of the two models B0 + B1.X1 + B3.X3 vs. B0 + B2.X2 + B4.X4 Note also that the beta's are always context dependent, so that the B1 in the model B0 + B1.X1 + B2.X2 does not have the same meaning as the B1 in the model B0 + B1.X1 9-2 Test for Overall Regression =============================== I consider this the least useful of the 3: We are seldom in the situation where we have no predictors; the "global" nature of the alternative hypothesis, namely that AT LEAST one of the beta's is nonzero, is not that helpful; and indeed, if one's focus is on the contribution of one particular variable, doing a global test is a poor way to assess it. Indeed, I would venture to say that most of the interesting tests have to do with a "clean" comparison involving just 1 variable. Before going to the mechanical part, one word about semantics. KKMN state the null hypothesis as "All k X's considered together do not explain a SIGNIFICANT amount of the variation in Y". This use of the word "significant" mixes metaphors so to speak. We teach in 607 that statistical hypotheses concern PARAMETER values. Thus, it is correct to postulate that "All k betas equal 0" or "at least one beta is nonzero". Such statements have nothing to do with data! But when we introduce the word "significant", we switch over to the world of statistics (ie. numbers calculated from empirical data, or samples). And STATISTICAL "significance" simply refers to a statistic exceeding some threshold. Thus, if one wants to be technically correct, the authors' "general" statement of H0 is not accurate. Better that they stay with "all beta's equal to zero" or sigma-squared (Y | X1, ..., Xk) = sigma-squared (Y), or X1, ..., Xk do not explain ANY of the variation in Y (they may appear to in a particular finite sample, but with n = infinity, the R-squared is zero!) Equation 9.1: F = MS(Regression) / MS(residual) =============================================== This F value is the very first tabulated item in most regression printouts. And, as I've said, it is often the least interesting! Note that the accompanying p-value refers to an "omni-directional" alternative hypothesis [In the "chi-square" statistic, LARGE values are evidence against the null; large negative and positive differences square to give a large value in the upper tail, i.e. a 2-SIDED hypothesis is judged using just the 1 (upper) TAIL of the reference distribution]. Here we have the same thing, but now with k betas at once: any departure from zero, on the negative or positive side, of any one beta, will create a LARGER expected F value. Equation 9-2 F in terms of R-squared ======================================= I don't see why one would bother with this representation, unless one likes to think of the numerator as the "R-squared per variable" and the denominator as the "unexplained variance per remaining degrees of freedom". "In interpreting the results of this test ..." ============================================== Note the use of the word "significantly" here again. Just like in 607, where you learned to distinguish "statistical significance" from real-world significance, you should get in the habit of making the same distinctions here. It is easy to slip into a "shorthand" among ourselves, but this wording may give the wrong impression to the lay public. The word significant, or "statistically significant" could be profitably replaced by "non-zero", i.e. we conclude that X1, X2 and X3 are "better than nothing" in predicting Y. This wording prompts the natural question "HOW MUCH better?". 9-3 Partial F test ===================== Given that the computer can be used to get us exactly what we need for the test, and that we don't have to go and reconstruct from elsewhere the pieces we need, this presentation is a bit long-winded. The procedure is neatly summarized in Equation 9.4 SS(X*|X1, ... Xp) F (X* | X1, ..., Xp) = ----------------------------- MS Residual (X1, ..., Xp, X*) The numerator is the "extra sum of squares" due to the addition of X*. (We will see later, it is actual a mean square as well: it has a hidden divisor of "1"). The denominator is the mean square residual (i.e. "average" residual) in the larger model. One shouldn't have to obtain the MS residual "indirectly" (line 9 of p. 141). One can fit the larger model, with the variable last in the model. That way, one gets the mean square residual directly. And the extra-sum of squares will be the SS in the last line above it [in both the "type I SS" and "type III SS" versions of the partitioned ANOVA table. In fact, one won't have to do any calculations: the F ratio, and associated p-value will be shown on the output. Type I and Type III SS ====================== The text doesn't go into them in any detail, until section 9-5-1 page 146, but it is important at this time to distinguish between the "type I" and type III SS. (See for example p. 146). Type I SS is the "cleaner" one, in that it is an "orthogonal" partition of the regression SS. i.e. the components add up exactly to the regression SS. The order corresponds EXACTLY to the order in which YOU type or "click" the variables into the regression model. So (usually) the type I SS partition will be different for the model Weight = Height Age and Weight = Age Height The corresponding partitions of the SS regression are SS (height) SS (Age) SS (age | height) SS (Height | Age) The diagram from Neter explains this well. In contrast, the type III SS are not orthogonal partitions: i.e. they do not usually add to the regression SS (that is a useful way to check if the partition is not clearly labelled). They may add up to more or less than the regression SS. The type III SS are independent of the order in which you type or click them into the computer program. The reason is that they consider a what-if scenario -- namely what if the variable in question were entered LAST in the list? So we will get the same SS numbers whether we ask for the model Weight = Height Age or Weight = Age Height We will get for type III SS SS(Height | Age) SS(Age | Height) SS(Age | Height) SS(Height | Age) 9-3-3 the t-test alternative to the partial F test ===================================================== This is in fact what most people use. The advantage is that the test is in the "beta_hat divided by its standard error" scale, and that one sees whether the sign on the beta_hat is negative or positive. The analogy with 607 is in using an F (with 1 and n1 + n2 - 2 df) rather than t-test do compare two sample means ybar_1 and ybar_0. Both give the same p-value if you pool the squared (within-group) residuals. Note that the t-tests on the individual beta_hats are equivalent to F-tests based on type III (i.e. variable added last) SS. 9-3-4 ===== Leave until later. 9-4 Multiple Partial F test ============================== This is one situation where you have to plan the setup carefully. Remember that you may need to do this test if you represent a single categorical variable by 2 or more indicator terms. Some programs, such as SAS PROG REG, do not crate indicator variables (one has to create them first); others do. For example, in PROC GLM one can specify that a variable is a "CLASS" i.e. categorical variable. In INSIGHT, one can specify that a variable is nominal (rather than interval). Again, one doesn't need to get caught up in all the equivalent ways of arriving at the numerator and denominator of the F test. The denominator is, again, the Mean Square Residual from the larger model. The numerator is now an extra regression SS due to the addition of the extra variable, but DIVIDED by the extra degrees of freedom used to fit the larger model. So the numerator is also an "average" or MEAN Square. i.e. MS (extra) F = ------------------------------- MS (residual from larger model) as is implied by equation 9.6. Don't bother with the other numerical equivalents. And in fact, try, whenever possible, to have the computer do all (or as many as possible of) the calculations for you. One way is to set up the equation so that you type or click in the extra variables LAST, and ask for the type I SS (that way, you don't have to separately fit the larger and reduced models). Say you want to test whether age and gender "add" anything to height in predicting weight. Get type I SS SS (height) SS (age | height) SS (gender | height age) the SUM of the second and third components is the "extra SS" you are looking for. Divide it by 2 to get the Mean Square (extra) for the numerator of the F test. The denominator is the MS (residual) from the model with height, age and gender. 9-5 Strategies ============= I hate to see these called strategies, as though statistics were "playing dice with the devil". The choice of type I (variables added in order) or type III (variables added last) SS should depend on the purpose and logic of the analysis. For example, is it that X1 and X2 are easily obtained pieces of information and X3 is an expensive new data item? Or is one simply trying to get a simple predictive model -- regardless of any "order" in the X's? Centering (footnote 3, page 145) ==================================== By considering product terms even at this early stage, the authors have created some numerical problems for themselves, and are using centering as a way to minimize them. The idea of centering the X variables is always a good one -- even in simpler situations. See my earlier notes on hurricanes this century, and the inaccuracies of "projecting ahead" from a far-away intercept. The important message from this section is the 2 non-equivalent ways of displaying the regression SS, one (type I) where they are all properly accounted for, and one (type III) where there is a certain amount of "double-counting" or "undercounting". 9-6 Tests involving the intercept ==================================== Models without an intercept are tricky. The concepts of R-squared does not carry over easily. And, there is usually little lost (except 1 df !!) if one fits an intercept when one didn't need to. So my advice -- for now -- is to ignore "no intercept" models. We might come back to them for certain analyses involving conditional logistic regression, Poisson regression, and Cox regression for survival data.