Notes on KKMN Chapter 9   1999.05.26

Preamble
========

The ideas in this Chapter are unique to Multiple Regression --
there is no analogy with Simple Linear Regression.

The topic is introduced in a very clear way right on the first
page of the chapter, and the statement "each test can be
interpreted as a comparison of two models" on p. 137 is key in
not getting lost in the details.  I would skip sections 9-3-4 at
first reading. Chapter 7 (Multiple Regression - II) of the Neter
text is good on this topic too.  It includes what -- to me at
least -- is a very clear diagram for explaining what it calls the
"Extra Sums of Squares".  I have included this diagram elsewhere
on the web page.


9-1 Preview
===========

Of the three types of tests, it is my experience that "2. Test
for addition of a single variable" is the most commonly used and
the most important.  It is also the only one of the three that an
be performed using the familiar t-test -- the others require the
F-test.

Note that when KKMN write about a "single" VARIABLE, they are
really referring to a "single" TERM in the equation.  If a
categorical variable with 3 or more levels, such as blood-type or
socio-economic status, is represented by several indicator terms
("dummy variables"), then situation 3 ("test for addition of a
group of variables") applies.  Thus, one cannot get by just by
learning how to deal with situation 2!

The unifying feature is the "larger model vs. smaller model" or
"full model vs. reduced model" idea introduced at the top of page
137.  Indeed all three tests can be put in this common framework.

1. (B1 = B2 = B3 = ... = Bk = 0) ie. 

            B0 + B1.X1 + B2.X2 + B3.X3
    vs.     
            B0

2. (B2 = O) ie.  

            B0 + B1.X1 + B2.X2 + B3.X3   
    vs.  
            B0 + B1.X1         + B3.X3

3. (B1 = B3 = 0) ie.  

            B0 + B1.X1 + B2.X2 + B3.X3   
    vs.  
            B0         + B2.X2

In 1, all beta's (B's) are being constrained; in 2, only one is
constrained; in 3, two parameters are constrained.

One important point to note here is that the "smaller" model must
be a special case (subset) of the larger one.  Thus, for example,
the chapter does not deal with a test of the two models

            B0 + B1.X1          + B3.X3    
       vs.   
            B0         + B2.X2           + B4.X4

Note also that the beta's are always context dependent, so that

the B1 in the model
     
    B0 + B1.X1 + B2.X2 
     
does not have the same meaning as the B1 in the model 
     
    B0 + B1.X1
    

9-2 Test for Overall Regression
===============================

I consider this the least useful of the 3: We are seldom in the
situation where we have no predictors; the "global" nature of the
alternative hypothesis, namely that AT LEAST one of the beta's is
nonzero, is not that helpful; and indeed, if one's focus is on
the contribution of one particular variable, doing a global test
is a poor way to assess it.  Indeed, I would venture to say that
most of the interesting tests have to do with a "clean"
comparison involving just 1 variable.

Before going to the mechanical part, one word about semantics. 
KKMN state the null hypothesis as

    "All k X's considered together do not explain a 
    SIGNIFICANT amount of the variation in Y".

This use of the word "significant" mixes metaphors so to speak. 
We teach in 607 that statistical hypotheses concern PARAMETER
values.  Thus, it is correct to postulate that "All k betas equal
0" or "at least one beta is nonzero".  Such statements have
nothing to do with data!  But when we introduce the word
"significant", we switch over to the world of statistics (ie.
numbers calculated from empirical data, or samples).  And
STATISTICAL "significance" simply refers to a statistic exceeding 
some threshold.

Thus, if one wants to be technically correct, the authors'
"general" statement of H0 is not accurate.  Better that they stay
with "all beta's equal to zero" or

    sigma-squared (Y | X1, ..., Xk) = sigma-squared (Y),

or

    X1, ..., Xk do not explain ANY of the variation in Y
     
    (they may appear to in a particular finite sample, but with 
     n = infinity, the R-squared is zero!)


Equation 9.1: F = MS(Regression) / MS(residual)
===============================================

This F value is the very first tabulated item in most regression
printouts.  And, as I've said, it is often the least interesting!

Note that the accompanying p-value refers to an "omni-directional" 
alternative hypothesis [In the "chi-square" statistic,
LARGE values are evidence against the null; large negative and
positive differences square to give a large value in the upper
tail, i.e. a 2-SIDED hypothesis is judged using just the 1 (upper)
TAIL of the reference distribution]. Here we have the same thing,
but now with k betas at once: any departure from zero, on the
negative or positive side, of any one beta, will create a LARGER
expected F value.


Equation 9-2    F in terms of R-squared
=======================================

I don't see why one would bother with this representation, unless
one likes to think of the numerator as the "R-squared per
variable" and the denominator as the "unexplained variance per
remaining degrees of freedom".

"In interpreting the results of this test ..."
==============================================

Note the use of the word "significantly" here again.  Just like
in 607, where you learned to distinguish "statistical
significance" from real-world significance, you should get in the
habit of making the same distinctions here.  It is easy to slip
into a "shorthand" among ourselves, but this wording may give the
wrong impression to the lay public.  The word significant, or
"statistically significant" could be profitably replaced by 
"non-zero", i.e. we conclude that X1, X2 and X3 are "better than
nothing" in predicting Y.  This wording prompts the natural
question "HOW MUCH better?".


9-3    Partial F test
=====================

Given that the computer can be used to get us exactly what we
need for the test, and that we don't have to go and reconstruct
from elsewhere the pieces we need, this presentation is a bit
long-winded. The procedure is neatly summarized in Equation 9.4


                                 SS(X*|X1, ... Xp)
  F (X* | X1, ..., Xp) =    -----------------------------
                            MS Residual (X1, ..., Xp, X*)


The numerator is the "extra sum of squares" due to the addition
of X*.  (We will see later, it is actual a mean square as well:
it has a hidden divisor of "1").  The denominator is the mean
square residual (i.e. "average" residual) in the larger model.

One shouldn't have to obtain the MS residual "indirectly" (line 9
of p. 141).  One can fit the larger model, with the variable last
in the model.  That way, one gets the mean square residual
directly.  And the extra-sum of squares will be the SS in the
last line above it [in both the "type I SS" and "type III SS"
versions of the partitioned ANOVA table.  In fact, one won't have
to do any calculations: the F ratio, and associated p-value will
be shown on the output.


Type I and Type III SS
======================

The text doesn't go into them in any detail, until section 9-5-1
page 146, but it is important at this time to distinguish between
the "type I" and type III SS.  (See for example p. 146).

Type I SS is the "cleaner" one, in that it is an "orthogonal"
partition of the regression SS.  i.e. the components add up
exactly to the regression SS.  The order corresponds EXACTLY
to the order in which YOU type or "click" the variables into
the regression model.  So (usually) the type I SS partition will
be different for the model

    Weight = Height Age      and        Weight = Age Height

The corresponding partitions of the SS regression are


     SS (height)                         SS (Age)
     SS (age | height)                   SS (Height | Age)

The diagram from Neter explains this well.

In contrast, the type III SS are not orthogonal partitions: i.e.
they do not usually add to the regression SS (that is a useful
way to check if the partition is not clearly labelled).  They may
add up to more or less than the regression SS. The type III SS are
independent of the order in which you type or click them into the
computer program.  The reason is that they consider a what-if
scenario -- namely what if the variable in question were entered
LAST in the list?  So we will get the same SS numbers whether we
ask for the model

    Weight = Height Age      or        Weight = Age Height

We will get for type III SS

    SS(Height | Age)                     SS(Age    | Height)
    SS(Age    | Height)                  SS(Height | Age)


9-3-3    the t-test alternative to the partial F test
=====================================================

This is in fact what most people use.  The advantage is that
the test is in the "beta_hat divided by its standard error" 
scale, and that one sees whether the sign on the beta_hat is
negative or positive.

The analogy with 607 is in using an F (with 1 and n1 + n2 - 2 df)
rather than t-test do compare two sample means ybar_1 and ybar_0. 
Both give the same p-value if you pool the squared (within-group) 
residuals.

Note that the t-tests on the individual beta_hats are
equivalent to F-tests based on type III (i.e. variable added
last) SS.


9-3-4
=====

Leave until later.


9-4    Multiple Partial F test
==============================

This is one situation where you have to plan the setup carefully. 
Remember that you may need to do this test if you represent a
single categorical variable by 2 or more indicator terms.  Some
programs, such as SAS PROG REG, do not crate indicator variables
(one has to create them first); others do.  For example, in PROC
GLM one can specify that a variable is a "CLASS" i.e. categorical
variable.  In INSIGHT, one can specify that a variable is nominal
(rather than interval).

Again, one doesn't need to get caught up in all the equivalent
ways of arriving at the numerator and denominator of the F test. 
The denominator is, again, the Mean Square Residual from the
larger model.  The numerator is now an extra regression SS due to
the addition of the extra variable, but DIVIDED by the extra
degrees of freedom used to fit the larger model.  So the
numerator is also an "average" or MEAN Square.

i.e.
                       MS (extra)
     F   =    -------------------------------
              MS (residual from larger model)

as is implied by equation 9.6.  Don't bother with the other
numerical equivalents.  And in fact, try, whenever possible, to
have the computer do all (or as many as possible of) the
calculations for you.

One way is to set up the equation so that you type or click in
the extra variables LAST, and ask for the type I SS (that way,
you don't have to separately fit the larger and reduced models). 
Say you want to test whether age and gender "add" anything to
height in predicting weight.

Get type I SS

SS (height)
SS (age    | height)
SS (gender | height age)

the SUM of the second and third components is the "extra SS" you
are looking for.  Divide it by 2 to get the Mean Square (extra)
for the numerator of the F test.  The denominator is the MS
(residual) from the model with height, age and gender.


9-5 Strategies
=============

I hate to see these called strategies, as though statistics were
"playing dice with the devil".  The choice of type I (variables
added in order) or type III (variables added last) SS should
depend on the purpose and logic of the analysis.  For example, is
it that X1 and X2 are easily obtained pieces of information and
X3 is an expensive new data item?  Or is one simply trying to get
a simple predictive model -- regardless of any "order" in the
X's?


Centering     (footnote 3, page 145)
====================================

By considering product terms even at this early stage, the
authors have created some numerical problems for themselves, and
are using centering as a way to minimize them.  The idea of
centering the X variables is always a good one -- even in simpler
situations.  See my earlier notes on hurricanes this century, and
the inaccuracies of "projecting ahead" from a far-away intercept.

The important message from this section is the 2 non-equivalent
ways of displaying the regression SS, one (type I) where they
are all properly accounted for, and one (type III) where there is
a certain amount of "double-counting" or "undercounting".


9-6    Tests involving the intercept
====================================

Models without an intercept are tricky.  The concepts of R-squared 
does not carry over easily.  And, there is usually little
lost (except 1 df !!) if one fits an intercept when one didn't need
to.  So my advice -- for now -- is to ignore "no intercept"
models.  We might come back to them for certain analyses
involving conditional logistic regression, Poisson regression,
and Cox regression for survival data.