Notes on KKMN Chapter 7

(The Analysis-of-Variance Table)

7-1 PREVIEW (WHY AN ANOVA TABLE?)
=================================

This use of the ANOVA table for regression goes beyond the
"classical" ANOVA, where we think of the units of observation
classified (or cross-classified) by the levels of one (or more)
factors.

The first thing that researchers in the (usually experimental)
classical situation do is to compute and display the mean Y and
SD(Y) within each "cell" in the 1- or multi-way "grid".  To me,
this grid of means SUMMARIZES the results of he study.  See for
example the rightmost columns of Table 17-7 (page 445) or the 9
cell means in Table 19-6 (Page 524).

From these within-cell means and SDs, one then forms the ANOVA
table (eg. Table 17-8 p445, and Table 19-7 and printout pp 
525-526). The ANOVA tables represent a SECOND LEVEL OF 
SUMMARIZATION, since the variation in the means (eg. 4 means in
Table 17.7) is rolled into a SINGLE number (the Mean Square 83.29).
In the 2-way layout of means in the example from page 524, the 
variation among these 9 means is summarized into 3 mean squares, 
one for the 3 row means, one for the 3 column means, and one for
the "non-additivity" of the row and column "effects".  

If a factor has more than 2 levels, it is impossible to recreate
the differences in the corresponding means if all one is given is
the Mean Square for the factor.  Strictly speaking , even if
the factor has just two levels, one cannot recreate the two means
from the mean square.  One can tell how much larger some one of the
means is than the other, but one cannot tell which of the 2 means
is the larger one!

One can think of regression as Y means indexed by one (or more)
QUANTITATIVE X variables [the X's in classical ANOVA are
categorical].  For the reasons just discussed, I prefer, whenever
possible, to think of and measure the differences in means as
slopes, rather than by squares.  When we are in a single
quantitative X situation, one doesn't gain anything by working
with the mean square for the model.  KKMN explain, in the middle
of page 106, that the (two-sided) t-test based on the slope, and
the F test based on the mean square, give exactly the same
result.  The t is actually more informative, since the SIGN of
the slope is lost in the mean square.

Likewise, when we are in a situation where we have already used 1
or more X's in a model, and we now want to assess the effect of
adding an additional quantitative X, we don't gain anything by
going the "square" route.

So, from a statistical TESTING viewpoint, the situation where one
MUST work with squares (the ANOVA table) is when one wishes to
test if 2 or more new X's TOGETHER add significantly to whatever
model we already have [this model we already have might have
served other X's, or it might be the "null" model with no X's at
all].

Our focus may not testing, but trying to quantify HOW MUCH of a
contribution a variable makes.  In such situations, we may not be
comfortable with simply using the (net) slope associated with
this new variable, especially if we are not very familiar with
the scale on which the Y and/or the X in question is measured. 
In such situations, one can take a scale-independent route, using
partial, and multiple partial correlations (cf Chapter 10). 

Unfortunately, most researchers think in terms of how much of the
remaining VARIANCE, not "explained" by the variables already in 
the equation, is explained by 1 (or simultaneously
several) new X variable(s). For variance, one must use the
"squares" approach.

Technically speaking, the sentence "the basic information in an
ANOVA table consists of several estimates of variance" is
inaccurate.  For example, with the correct simpler regression
model, the MSE is indeed an estimate of the error variance, while
the MS for the regression (model) is an estimate of a combination
of the error variance and the square of the true slope, beta_1. 
(The authors explain this later, in the 1st full paragraph of p.
106).

7-2 "FUNDAMENTAL EQUATION" 7.1
==============================

This is indeed fundamental.  All textbooks explain or at least
motivate equation 7.1 by a diagram such as Figure 7-1.  But many,
such as KKMN, don't prove this relation, preferring instead to
use words such as "it turns out that" the equation is true.  It
is not that difficult to prove from first principles.  For the
curious, here is one way.

    Y(i) - Ybar = [Y(i) - Yhat] + [Yhat - Ybar]

                                    (this is what Fig 7-1 shows)

                = e(i)          +       D(i)

Here, I am using D(i) for the "systematic" part of the 
Y(i) - Ybar, and e(i) for the residual part.  

Now, square each side and drop the subscript (i) to keep it simple,
giving  (I will use e^2 for e-squared) 

    (Y - Ybar)^2 = e^2 + D^2 + 2De ,

So

    Sum {(Y - Ybar)^2} = Sum {e^2} + SUM {D^2} + 2SUM {De}.

Equation 7-1 only has the first two of the three sums, implying
that the third one, Sum {De} is zero.

Now,

    D = Yhat                  - Ybar
	
      = [Ybar + b1(X - Xbar)] - Ybar
	  
      = b1(X - Xbar) 
	  
	  = b1.X -   b1.Xbar

So, since b1 and Xbar are the same for each observation,

    Sum{De} = b1.Sum{Xe} - b1.Xbar.Sum{e}

The two sums on the right hand side are zero, since the two
estimating equations for b0 and b1 are

    Sum{e}  = 0,
and
    Sum{xe} = 0.

See the earlier exercise, in Chapter 5, on figuring out the
missing Y's or E's from the n - 2 independent residuals.  Again,
it's a question of constraints.


THE BASIS FOR THE F TEST [1st Paragraph, p. 106]
================================================

For those familiar with the classical ANOVA, this is similar to
the basis for the F test in the classical situation.

IF YOU WISH, YOU CAN SKIP TO THE NEXT HEADING
=============================================

One way to think of the theoretical basis for it is to think of
the results one might get if one regressed the 30 blood pressures
in Table 5-1, not on the persons' ages, but on say X = the second
last digit of their phone numbers, or social security numbers, or
X = which day of the month they were born on.

The slope, b1, would not be EXACTLY zero.  If one did this in 
different samples of 30 having the same X values {X1, X2, ..., X30},
how much would the b1 fluctuate around zero?  And how big would
the mean square regression term be? If b1 is nonzero, this regression
mean square will be positive. The mean square has the form 
Sum{(Yhat - Ybar)^2}, where Yhat - Ybar is the "amplitude" of each
fitted value.  Now a particular Yhat - Ybar is simply b1(X - Xbar);
over repeated samples of size n, with the X's the same
each time, the amplitude has an expected value of zero and its
square has an expected value of the square of b1.  The third
equation on page 53 gives, without proof, the SD of b1.  Squaring
this standard deviation, we get that the variance of b1 is

   variance of b1 = sigma^2 / Sum{(X - Xbar)^2},

So the expected value of the square of the amplitude itself is

    (X - Xbar)^2 sigma^2 / Sum {(X - Xbar)^2}.

Thus, the expected value of the Sum of the squared amplitudes is

    Sum{(X - Xbar)^2 Sigma^2} / Sum {(X - Xbar)^2}.

Cancelling the top and bottom, we get that -- if there is no
underlying relation, so that beta_1 is zero -- the expected value
of the regression sum of squares is just sigma^2.

ANOTHER WAY OF THINKING ABOUT r^2.
==================================

At the beginning of section 7-2, KKMN write it as (SSY-SSE)/SSY.
But later on on page 105, they give SSY - SSE the name
"Sum of squares due to regression", or what they will, in later
chapters, abbreviate to "SS(regression)".

As they illustrate by equation 7.1, one doesn't have to think of
the SS(regression) as having to be obtained by subtraction (like
when we used to say non-A- non-B hepatitis!).

To use their words, SS(R) "turns out" to be SUM{(Yhat - Ybar)^2}.
Think of it as the sum of the squared amplitudes of the fitted
points (the Yhat's).  Likewise, 

    SSY = Sum{(Y - Ybar)^2},

i.e. the sum of the squared "amplitudes" of the observed Y values. 

So,

            SS(amplitudes of the Yhat's)
    r^2  =  ----------------------------
            SS(amplitudes of the Y's)


             average squared Yhat amplitude
         =   ------------------------------
             average squared Y amplitude

Thus, if the line is a perfect fit, the 2 sets of amplitudes will
be identical, and the ratio will be 1.  On the other hand, if the
line is very "shallow" relative to the observed data, the fitted
points (Yhat's) will have a small amplitude relative to the
amplitude of the observed data, and so the ratio (the r^2) will
be small.

For some readers, this is a more positive way of putting r^2, ie
focussing on what X DOES explain, rather than on saying 1 - r^2
is the proportion of variance in Y that X DOES NOT explain.

ALTERNATIVE REPRESENTATION OF ANOVA TABLE (P107).
=================================================

This alternative is seldom helpful, unless one is genuinely
interested in the AVERAGE level, per se, of Y.

FITTING THE NULL MODEL AS A POINT OF DEPARTURE
==============================================

Most software packages, can fit the "null" regression model

    E(Y|X) = beta_0.

In this case, the beta_0 estimate is nothing more than Ybar! 
However, when you add X's, the packages, by default, assume that
the interest is not in Ybar, or Y levels, per se but in the
DEVIATIONS from Ybar.  That's why they immediately take Ybar out
of the ANOVA table, and start to account for (partition) the
variation of the n - 1 independent deviations ("residuals") from
Ybar.

In SAS, one can fit the null model in PROG REG or PROC GLM by
writing "MODEL Y =;".  In INSIGHT, in the "FIT Y|X" dialogue box,
just specify Y and click "OK".  

As mentioned, the intercept is none other than Ybar; and its
standard error, and the t-test of whether mu(Y) = 0, are the same
as those you learned in the 1-sample t test in the 607 course!

PROBLEMS
========

1. (a) If one hadn't been given the two sufficient pieces of
information supplied, and wanted to do the work, using only what
appears on p.61 and p.62, on a calculator that did "one-variable
statistics", one way would be to

(i)   get SSY using the ||Y's on p.61 ;

(ii)  get SS(Yhat) using the ||Yhats on p.62 ;

(iii) get SS(error) by subtraction.

Another, even shorter way, would be to apply the R-square of
0.7442 to the SSY, to get (ii).

11. (a) The authors make this too easy!  They could have blanked
out the 0.36618 and had the reader figure it out from information
given elsewhere in the output.  Indeed, they could have removed
the R-square too and one could still reconstruct the table.  How? 

They could also have blanked out the intercept of -0.4624 and
asked the reader to reconstruct it.  Show how. (Hint: use the
2.045).

Rewrite the fitted equation using the intercept corresponding to
X = 40.

This book, by American authors, does not indicate whether the
temperatures in exercise 11, p. 109, are in "degrees American"
(F) or "degrees International" (C).  Assume they are in F, and
convert the equation so X (temperature) is in degrees C.

----

11.  Suppose there were but 1 cell line cultured at each
temperature.  One researcher might be tempted to just "join the 4
dots" in order to interpolate the expected growth for say X = 70.

Give some reasons why this might not be the best way, and why
fitting a single straight line through all 4 data points might be
better.

In which situation might an interpolation using only data from
the 2 neighbouring temperatures (60 and 80) make more sense? 
What is the tradeoff between the two approaches?

Another researcher, who sees the polynomial fitting option in the
plot outputted from the FIT (Y|X) option in INSIGHT, uses it to
fit a 3rd degree polynomial to interpolate.  As you will see if
you try it, it can fit the 4 points perfectly.  What are the
drawbacks of using as flexible a polynomial as the data allow? 
Hint: look at the phrase, starting on 1st line of p. 109,
"removing a small sample ... to estimate ..."