Notes on KKMN Chapter 7 (The Analysis-of-Variance Table) 7-1 PREVIEW (WHY AN ANOVA TABLE?) ================================= This use of the ANOVA table for regression goes beyond the "classical" ANOVA, where we think of the units of observation classified (or cross-classified) by the levels of one (or more) factors. The first thing that researchers in the (usually experimental) classical situation do is to compute and display the mean Y and SD(Y) within each "cell" in the 1- or multi-way "grid". To me, this grid of means SUMMARIZES the results of he study. See for example the rightmost columns of Table 17-7 (page 445) or the 9 cell means in Table 19-6 (Page 524). From these within-cell means and SDs, one then forms the ANOVA table (eg. Table 17-8 p445, and Table 19-7 and printout pp 525-526). The ANOVA tables represent a SECOND LEVEL OF SUMMARIZATION, since the variation in the means (eg. 4 means in Table 17.7) is rolled into a SINGLE number (the Mean Square 83.29). In the 2-way layout of means in the example from page 524, the variation among these 9 means is summarized into 3 mean squares, one for the 3 row means, one for the 3 column means, and one for the "non-additivity" of the row and column "effects". If a factor has more than 2 levels, it is impossible to recreate the differences in the corresponding means if all one is given is the Mean Square for the factor. Strictly speaking , even if the factor has just two levels, one cannot recreate the two means from the mean square. One can tell how much larger some one of the means is than the other, but one cannot tell which of the 2 means is the larger one! One can think of regression as Y means indexed by one (or more) QUANTITATIVE X variables [the X's in classical ANOVA are categorical]. For the reasons just discussed, I prefer, whenever possible, to think of and measure the differences in means as slopes, rather than by squares. When we are in a single quantitative X situation, one doesn't gain anything by working with the mean square for the model. KKMN explain, in the middle of page 106, that the (two-sided) t-test based on the slope, and the F test based on the mean square, give exactly the same result. The t is actually more informative, since the SIGN of the slope is lost in the mean square. Likewise, when we are in a situation where we have already used 1 or more X's in a model, and we now want to assess the effect of adding an additional quantitative X, we don't gain anything by going the "square" route. So, from a statistical TESTING viewpoint, the situation where one MUST work with squares (the ANOVA table) is when one wishes to test if 2 or more new X's TOGETHER add significantly to whatever model we already have [this model we already have might have served other X's, or it might be the "null" model with no X's at all]. Our focus may not testing, but trying to quantify HOW MUCH of a contribution a variable makes. In such situations, we may not be comfortable with simply using the (net) slope associated with this new variable, especially if we are not very familiar with the scale on which the Y and/or the X in question is measured. In such situations, one can take a scale-independent route, using partial, and multiple partial correlations (cf Chapter 10). Unfortunately, most researchers think in terms of how much of the remaining VARIANCE, not "explained" by the variables already in the equation, is explained by 1 (or simultaneously several) new X variable(s). For variance, one must use the "squares" approach. Technically speaking, the sentence "the basic information in an ANOVA table consists of several estimates of variance" is inaccurate. For example, with the correct simpler regression model, the MSE is indeed an estimate of the error variance, while the MS for the regression (model) is an estimate of a combination of the error variance and the square of the true slope, beta_1. (The authors explain this later, in the 1st full paragraph of p. 106). 7-2 "FUNDAMENTAL EQUATION" 7.1 ============================== This is indeed fundamental. All textbooks explain or at least motivate equation 7.1 by a diagram such as Figure 7-1. But many, such as KKMN, don't prove this relation, preferring instead to use words such as "it turns out that" the equation is true. It is not that difficult to prove from first principles. For the curious, here is one way. Y(i) - Ybar = [Y(i) - Yhat] + [Yhat - Ybar] (this is what Fig 7-1 shows) = e(i) + D(i) Here, I am using D(i) for the "systematic" part of the Y(i) - Ybar, and e(i) for the residual part. Now, square each side and drop the subscript (i) to keep it simple, giving (I will use e^2 for e-squared) (Y - Ybar)^2 = e^2 + D^2 + 2De , So Sum {(Y - Ybar)^2} = Sum {e^2} + SUM {D^2} + 2SUM {De}. Equation 7-1 only has the first two of the three sums, implying that the third one, Sum {De} is zero. Now, D = Yhat - Ybar = [Ybar + b1(X - Xbar)] - Ybar = b1(X - Xbar) = b1.X - b1.Xbar So, since b1 and Xbar are the same for each observation, Sum{De} = b1.Sum{Xe} - b1.Xbar.Sum{e} The two sums on the right hand side are zero, since the two estimating equations for b0 and b1 are Sum{e} = 0, and Sum{xe} = 0. See the earlier exercise, in Chapter 5, on figuring out the missing Y's or E's from the n - 2 independent residuals. Again, it's a question of constraints. THE BASIS FOR THE F TEST [1st Paragraph, p. 106] ================================================ For those familiar with the classical ANOVA, this is similar to the basis for the F test in the classical situation. IF YOU WISH, YOU CAN SKIP TO THE NEXT HEADING ============================================= One way to think of the theoretical basis for it is to think of the results one might get if one regressed the 30 blood pressures in Table 5-1, not on the persons' ages, but on say X = the second last digit of their phone numbers, or social security numbers, or X = which day of the month they were born on. The slope, b1, would not be EXACTLY zero. If one did this in different samples of 30 having the same X values {X1, X2, ..., X30}, how much would the b1 fluctuate around zero? And how big would the mean square regression term be? If b1 is nonzero, this regression mean square will be positive. The mean square has the form Sum{(Yhat - Ybar)^2}, where Yhat - Ybar is the "amplitude" of each fitted value. Now a particular Yhat - Ybar is simply b1(X - Xbar); over repeated samples of size n, with the X's the same each time, the amplitude has an expected value of zero and its square has an expected value of the square of b1. The third equation on page 53 gives, without proof, the SD of b1. Squaring this standard deviation, we get that the variance of b1 is variance of b1 = sigma^2 / Sum{(X - Xbar)^2}, So the expected value of the square of the amplitude itself is (X - Xbar)^2 sigma^2 / Sum {(X - Xbar)^2}. Thus, the expected value of the Sum of the squared amplitudes is Sum{(X - Xbar)^2 Sigma^2} / Sum {(X - Xbar)^2}. Cancelling the top and bottom, we get that -- if there is no underlying relation, so that beta_1 is zero -- the expected value of the regression sum of squares is just sigma^2. ANOTHER WAY OF THINKING ABOUT r^2. ================================== At the beginning of section 7-2, KKMN write it as (SSY-SSE)/SSY. But later on on page 105, they give SSY - SSE the name "Sum of squares due to regression", or what they will, in later chapters, abbreviate to "SS(regression)". As they illustrate by equation 7.1, one doesn't have to think of the SS(regression) as having to be obtained by subtraction (like when we used to say non-A- non-B hepatitis!). To use their words, SS(R) "turns out" to be SUM{(Yhat - Ybar)^2}. Think of it as the sum of the squared amplitudes of the fitted points (the Yhat's). Likewise, SSY = Sum{(Y - Ybar)^2}, i.e. the sum of the squared "amplitudes" of the observed Y values. So, SS(amplitudes of the Yhat's) r^2 = ---------------------------- SS(amplitudes of the Y's) average squared Yhat amplitude = ------------------------------ average squared Y amplitude Thus, if the line is a perfect fit, the 2 sets of amplitudes will be identical, and the ratio will be 1. On the other hand, if the line is very "shallow" relative to the observed data, the fitted points (Yhat's) will have a small amplitude relative to the amplitude of the observed data, and so the ratio (the r^2) will be small. For some readers, this is a more positive way of putting r^2, ie focussing on what X DOES explain, rather than on saying 1 - r^2 is the proportion of variance in Y that X DOES NOT explain. ALTERNATIVE REPRESENTATION OF ANOVA TABLE (P107). ================================================= This alternative is seldom helpful, unless one is genuinely interested in the AVERAGE level, per se, of Y. FITTING THE NULL MODEL AS A POINT OF DEPARTURE ============================================== Most software packages, can fit the "null" regression model E(Y|X) = beta_0. In this case, the beta_0 estimate is nothing more than Ybar! However, when you add X's, the packages, by default, assume that the interest is not in Ybar, or Y levels, per se but in the DEVIATIONS from Ybar. That's why they immediately take Ybar out of the ANOVA table, and start to account for (partition) the variation of the n - 1 independent deviations ("residuals") from Ybar. In SAS, one can fit the null model in PROG REG or PROC GLM by writing "MODEL Y =;". In INSIGHT, in the "FIT Y|X" dialogue box, just specify Y and click "OK". As mentioned, the intercept is none other than Ybar; and its standard error, and the t-test of whether mu(Y) = 0, are the same as those you learned in the 1-sample t test in the 607 course! PROBLEMS ======== 1. (a) If one hadn't been given the two sufficient pieces of information supplied, and wanted to do the work, using only what appears on p.61 and p.62, on a calculator that did "one-variable statistics", one way would be to (i) get SSY using the ||Y's on p.61 ; (ii) get SS(Yhat) using the ||Yhats on p.62 ; (iii) get SS(error) by subtraction. Another, even shorter way, would be to apply the R-square of 0.7442 to the SSY, to get (ii). 11. (a) The authors make this too easy! They could have blanked out the 0.36618 and had the reader figure it out from information given elsewhere in the output. Indeed, they could have removed the R-square too and one could still reconstruct the table. How? They could also have blanked out the intercept of -0.4624 and asked the reader to reconstruct it. Show how. (Hint: use the 2.045). Rewrite the fitted equation using the intercept corresponding to X = 40. This book, by American authors, does not indicate whether the temperatures in exercise 11, p. 109, are in "degrees American" (F) or "degrees International" (C). Assume they are in F, and convert the equation so X (temperature) is in degrees C. ---- 11. Suppose there were but 1 cell line cultured at each temperature. One researcher might be tempted to just "join the 4 dots" in order to interpolate the expected growth for say X = 70. Give some reasons why this might not be the best way, and why fitting a single straight line through all 4 data points might be better. In which situation might an interpolation using only data from the 2 neighbouring temperatures (60 and 80) make more sense? What is the tradeoff between the two approaches? Another researcher, who sees the polynomial fitting option in the plot outputted from the FIT (Y|X) option in INSIGHT, uses it to fit a 3rd degree polynomial to interpolate. As you will see if you try it, it can fit the 4 points perfectly. What are the drawbacks of using as flexible a polynomial as the data allow? Hint: look at the phrase, starting on 1st line of p. 109, "removing a small sample ... to estimate ..."