Notes on KKMN Chapter 8

See

-  Moore and McCabe (IPS), 
   Chapter 9.2 (2nd Ed) or Chapter 11 (3rd Ed)

See also:

- my notes from 607 (on the web page)

- the book Statistical Methods for Comparative Studies by
  Anderson S, Auquier S, et al.  If gives a very good overview
  (non-mathematical) of how important multiple regression is in
  adjusting for confounding.

- the article Appropriate Uses of Multivariate Analysis (Hanley,
  Annual Review of Public Health, 1983) in

    www/epi.mcgill.ca/~web2/courses/c697/datasets.html

... 

It is important, before getting into the technicalities in
Chapters 8 onwards, to understand WHY we use multiple regression.


8-1 PREVIEW
===========
1.  The "best" model: that might not be the ultimate aim.

2.  Visualization.  It becomes less of a problem if one focuses
on 2-D plots and on the residuals.

3.  Meaning/Interpolation.  Even in lower-dimensional models, one
needs a balance between pure expediency, mathematical fits,
approximations, and biological reality.  I think it was the
statistician Tukey who said something like "All models are wrong,
but some are more helpful than others".

4.  Computation.  In fact, if one had a lot of patience, one
could fit a multiple regression as a series of simple
regressions.  See the Weight[age, height] example in my notes on
multiple regression in 607, M & M Chapter 2/9). From it, can you
see how this could be extended to 3 or more "X" variables?


8-2 MULTIPLE REGRESSION MODELS
==============================
As the authors make clear, ultimately, the "size" of the model is
the number of terms, and thus coefficients, in the equation, not
the number of "basic" variables.

Miettinen in the Chapter on Regression Analysis in his textbook,
Theoretical Epidemiology, makes a distinction between subject
characteristics (determinants) and their translation into
STATISTICAL VARIATES the actual terms in the regression model.  A
single determinant; smoking, might lead to 4 or 5 terms in the
regression model.

Obviously, one cannot use a very high order polynomial in a
single basic variable to "make a silk purse from a sow's ear".

Example 8.1
===========

It worries me a bit that the authors start out with models with
up to 5 terms (the last equation at the top of p. 113, involving
X1, X2, their squares and their product) for a little dataset of
n = 12 observations.  As we will discuss in later chapters, there
is a big danger of "overfitting" in such situations, and what is
"the best predictive model" for these 12 observations may be very
poor in other individuals of the same type as studied here.

8.3 GRAPHICAL LOOK AT THE PROBLEM
=================================

It is going a bit outside of regression models, but another way
to think of the case of k = 2 is to think of a digitized image,
such as you might obtain if you scanned a photograph into a
computer file, using say 256 levels of gray.  Then each
horizontal and vertical pixel has a color (Y) between 0 (say
white) and 1 (black), or between 00000000(0) and 11111111(255) if
using the 8-bit color scale.  The "address" of a particular pixel
is in terms of its distance from say the bottom left corner (H
pixels horizontally, V pixels vertically).

If the pattern of colours was sufficiently smooth, one might be
able to "fit" (predict) the colours Y of each pixel using
function of H and V.  Admittedly one might need a very large
number of terms, involving H and V, in the model.

In practice, regression functions of H and V are not the most
efficient way to "capture" or approximate the signal in an image,
but the idea is still the same -- the prediction of Y (here
colour) from H and V, with as little "error" as possible.  

Another way to think of Figure 8-2 that will help when it comes
to the "stability" of a fitted model is to think of the surface
as a trampoline (or flat hammock!) that is being held in place by
a series of ropes, with each rope secured to a data point. 
Imagine how the effect of removing one data point (i.e. cutting
one rope)!  If the data points ("support" for the surface) have
co-ordinates along the (X1,X2) diagonal i.e. mostly along a
"South-west to North-east", corridor, there will be considerably
more instability.  We will use this idea again when we come to
"collinearity" in Chapter 12 - section 5.

How much can we learn from X1 and X2 separately? (p. 114)
=========================================================

It is IMPORTANT at this stage to UNDERSTAND what is fundamentally
different between the following two examples 

(i)    the 1986 Canadian birthweight data discussed in the notes
       I had made for Section 9.2 (multiple regression) in 607
       [weights as a function of X1 = gestational age and 
       X2 = gender] 
        
(iia)  the weights of Alberta children aged 11-16 as a function
       of X1 = age and X2 = height [in the same notes].  
       
(iib)  example 8.1 in the KKMN text, again with weight as a 
       function of age and height.
       
In the birthweight data, one can estimate the gender difference
pretty well just by subtracting the average female weight from 
the average female height.  One doesn't need to do so "age-
specifically"  (i.e. taking the difference within each age
separately, then averaging, in some way, the age-specific
differences).

Likewise one can look at the effect of age using the two genders
combined and get an answer that is very similar to those which
one would get if one looked at the age effects gender-specifically
(i.e. within each gender).

However, the same indifference of the answers to how one
estimates beta_1 is not true for the "weight as a function of age
and height" data.  The slope of weight on height is 3.363 lb/inch
if one uses ALL of the data, WITHOUT taking age into account. 
But the slope is only 2.725 lbs/inch if one takes age into
account.

You see the same pattern in the textbook example: 1.07 lbs/inch 
if one ignores age (Model 1, page 121) but only 0.72 lbs/inch if
one takes age into account (Model 4, page 122).

The reason that the slope estimate changes in examples (iia) and
(iib) but does not in (i) has to do with the X2 profile of those
involved in the contrasts implied by the "slope".  Effectively,
-- or conceptually -- the slope of 3.36 lbs/inch was obtained by
contrasting weights of those of a certain height with weights of
those 1 inch taller. But those who are 1 inch taller also tend to
be older, so it is not a "pure" contrast, whereas the 2.725 
lbs/inch is (if our model is correct!).

The same applies to the Alberta children ... taller children tend
to be older, so again the slope of weight on height is not
"pure".

The same applies if we turn Example 8.1 round the other way, and
look at the slope associated with age, when we do not, or do,
take height into account, i.e. 3.64 lbs/year if we ignore height
and 2.05 lbs/year if we take it into account.

But we see no such patterns with the birthweight data ... we can
estimate the "pure" gender difference even when ignoring age, and
the "pure" age effect when ignoring gender i.e. pooling all the
data together.  The reason this works here is that when we
compare weights of ALL of the infants of a certain gestational
age with the weights of ALL of the infants who were born 1 week
later, the male: female mix is about the same each week. 
Likewise, if we compare the weights of ALL boys with the weights
of ALL girls (regardless of gestational age), the boys have a
gestational-age profile very like that of girls -- i.e. it is not
that boys are born later!

8-4 ASSUMPTIONS OF MULTIPLE REGRESSION
======================================

There are essentially the same as for simple linear regression in
Chapter 5, so I won't belabour the points I made there.

As for 2(Independence), I cannot remember if I gave my usual
example of how one can read a paper, from a laboratory which you
know has only 7 persons, in which there are a total of 20
observations, and in which the methods state that the "subjects"
were the laboratory personnel who volunteered for the study!

3 (Linearity).  As the authors partly explain, the "linearity" is
at two levels.  First, STRICTLY SPEAKING, linearity refers to the
beta's, not to the X's: the mean is a linear combination of
beta's. (See footnote; or see Miettinen p. 217, or my notes on 
simple linear regression for Moore and McCabe's Chapter 9).  

Linearity of the betas allows us to fit the beta's by least squares
in one pass through the data, and get standard errors easily etc). 

Second, most users think of linearity in terms of whether -- even
if there is just one X -- the conditional mean mu(Y|X) is a
straight line function of X.  Even with all the other X's in the
model, one can still think of "freezing" them at certain values,
and concentrating only on the Y variation with respect to the X
that one is allowing to vary, and asking whether in this
restricted situation, 

    mu(Y|X in question, and all other X's held constant) 
    
is a linear function of the X in question.

The authors do a good job of explaining that the polynomial
regression model 

    mu(Y|X) = beta_0 + beta_1.X + beta_2 X^2

describes a CURVE in X, but is technically a multiple LINEAR
regression model just as if the model had been

    beta_0 + beta_1.Z1 + beta_2.Z2

Indeed it might have been good to have used different letters for
the "basic" variables than for the variables actually used in the
equation.  You will find that in SAS, or any other statistical
package, if you want to use higher order terms you will have to
"create" them and give them new names.  The regression package
will not know how you created them!


4 (Homoscedasticity)
====================

I wish the authors had not used equation 8.3, but used the
"equivalent" version at the top of the next page.  Even though
8.3 is technically correct, most students remember the "Y" part
of the "Y|X1, X2, ... Xk" but forget the critical "|X1, X2, ...,
Xk" part.

IF YOU ARE GOING TO FUSS ABOUT THE CONSTANCY (AND GAUSSIAN-NESS)
OF THE VARIATION, THINK OF THE VARIATION OF THE E'S, NOT THE Y'S.

5 (Normality)
=============

Again, ask yourself if the central Limit Theorem is likely to
"kick in", when using the t-tests, CI's etc... concerning betas. 
And ask yourself if you want to get the mu's approximately
correct before worrying about the sigma's.

8-4-2 (Summary and Comments)
============================

The assumption of a Gaussian distribution is "needed" here to the
same extent that it is needed for the 2-sample t-test --- AND HOW
BADLY IS IT "NEEDED" THERE?

8-5-3
=====

1. To make this concrete, suppose one had the dataset with 4
observations

i      X1 (Gender)      X2 (Gest-Age)        Y (birthweight)
-      -----------      -------------        ---------------
1          0               36 (0)              2950 grams
2          0               36 (0)              3050 grams
3          0               40 (4)              3500 grams 
4          1               40 (4)              3600 grams
-------------------------------------------------------------
(X1 ia an indicator variable for males, and X2 is in weeks)

The fitted equation is

    mu(Y|X1, X2)  = -1500 + 100.X1 + 125.X2

If instead we "start" X2 at 36 weeks, so we have X2* = X2 - 36,
then

    mu(Y|X1, X2*) = 3000 + 100.X1 + 125.X2*

In the latter formulation, we get

 beta_0_hat =  1/2 . Y1 +  1/2 . Y2 +   0  . Y3 +  0 . Y4 = 3000
 beta_1_hat =   0  . Y1 +   0  . Y2 -   1  . Y3 +  1 . Y4 =  100
 beta_2_hat = -1/8 . Y1 -  1/8 . Y2 +  1/4 . Y3 +  0 . Y4 =  125

You can "see" the logic behind the 3 combinations:

  beta_0_hat is the average of the two females at 36 weeks;

  beta_1_hat is the difference between the male and the female
  at 40 weeks;
  
  beta_2_hat is the difference between the female at 40 weeks
  and the average of the 2 females at week 36 -- i.e.
  Y3 - (1/2)(Y1 + Y2), all divided by 4 because of the 4-week
  difference in gestational age.


8-5-3
=====

2. Multiple Correlation Coefficient
===================================

This would have been another criterion by which to fit the
regression equation: choose the coefficients that maximize the
correlation of the predicted and observed values!  The multiple
correlation coefficient is a useful scalar (1-dimensional) value
that gives the big picture without getting into the specifics of
how much each X variable contributes to the Y-hat.

The comment about the mean of the predicted Y values equalling
the mean of the observed Y value is reminiscent of the fact that
in simple linear regression, the line goes through (Xbar, Ybar)
and that AVE{Y} = AVE{Ybar}.

The 2 estimating equations for the simple linear regression case
are

    Sum{  (Y - Yhat) } = 0

and

    Sum{ X(Y - Yhat) } = 0

In other words, the residuals must add to zero and the weighted
average of the residuals (with the X's as weights) must add to
zero.

For multiple regression, with variables X1, X2, ..., Xk, one
simply extends these to k + 1 estimating equations

    Sum{   (Y - Yhat) } = 0
    Sum{ X1(Y - Yhat) } = 0
    Sum{ X2(Y - Yhat) } = 0
     .
     .
    Sum{ Xk(Y - Yhat) } = 0


When one substitutes b0, b1, ... bk and the X's into Yhat, one
gets what are called the "Normal Equations" (I don't remember why
they are called "Normal")

  Sum{   Y } = b0         + b1.Sum{X1   }  + ... bk.Sum{Xk   }
  Sum{X1.Y } = b0.Sum{X1} + b1.Sum{X1^2 }  + ... bk.Sum{Xk.X1}
  Sum{X2.Y } = b0.Sum{X2} + b1.Sum{X1.X2}  + ... bk.Sum{Xk.X2}
   .
   .
   .
  Sum{Xk.Y } = b0.Sum{Xk} + b1.Sum{X1.Xk}  + ... bk.Sum{Xk^2 }


8-5-3 Multivariate Normal Distribution
======================================

I wouldn't worry about this.  In any case, it is difficult to
imagine k "X" variables and 1 "Y" variable that are jointly
multivariate Gaussian.

Example 8-2
(Fitted equation with Height, Age, and Age^2.)
==============================================

I'm a big proponent of, whenever possible, drawing the fitted
equation through the data points.  (Statisticians joke about how
engineers tend to plot the line first and put the data points in
afterwards!)  Here, with 1 Y and 3X's (2 of the X's derived from
Age), it is not that easy. But, one can choose some specific
heights, and graph the fitted average weight vs. Age relation for
these selected heights.

For example, take heights of 45, 50, 55, and 60. Then the fitted
equations are

    3.348 + 0.742 (45) i.e. 36.0 + 2.777Age - 0.042Age^2
    3.348 + 0.742 (50) i.e. 39.6 + 2.777Age - 0.042Age^2
    3.348 + 0.742 (55) i.e. 43.3 + 2.777Age - 0.042Age^2
    3.348 + 0.742 (60) i.e. 46.9 + 2.777Age - 0.042Age^2

Now at ages 6, 9 and 12, the 2.777Age - 0.042Age^2 values are
15.2, 21.6 and 27.3, so the fitted values for weight are (in lbs)

Height 60"    62.1            68.5            74.2
Height 55"    58.5            64.9            70.6
Height 50"    54.8            61.2            66.9
Height 45"    51.2            57.6            63.3
              ====================================
              Age 6           Age 9         Age 12


Mind you, SAS INSIGHT (and SAS GLM and REG, if you ask) produce
these fitted values directly.  But you will need to be able to do
this when we come to equations with product terms to accommodate
effect modification.  For a depiction of the fitted equation, see
the www page.  See if you can distinguish it from the simpler one
for model 4.


8-6 ANOVA for MULTIPLE REGRESSION
=================================

In one sense, the extension to k(> 1) variables is trivial.  The
partitioning of the total variation into the "regression" and
"residual" sums of squares is identical to the "1 X" situation. 
The degrees of freedom partition in the same way.  The only
apparent difference (and it isn't really, if one keys track of
how many degrees of freedom are taken up in fitting the model)
comes with the F test for testing whether the fitted equation is
better than the "null" model

    mu(Y|X1, X2) = B0 + 0.X1 + 0.X2 = B0.

In other sense though, the fact that k>1 means that there isn't
just one question: one can ask.

    Is a model with X1 and X2 better than one with X1?
    Is a model with X1 and X2 better than one with X2?
    Is a model with X1, X2, X3 better than one with X3?

etc...

These questions will be dealt with in the next chapter.

Re. Equation 8.7 (R-Squared)
============================
Again, I would try to put the sum of squares due to the
regression in positive terms, e.g.

    SUM {(Yhat - Ybar)^2}

the sum of the squares of the amplitudes of the fitted points.

8.7 Numerical Examples
======================

The authors apologize that they ONLY fitted 6 models, and that
many more are possible.  But I find that some of the ones they
fit, especially going to models 3 and 5 before say models
involving powers of height, are not that inspired, and are simply
ringing all the combinations.  From 1st principles, and just
thinking about it, I would argue that weight is primarily a
function of height, but that it can't just be linear in height
itself.  If anything, if to a first approximation, we think of
the person as a "cyclinder", wouldn't weight be proportional to
volume which in turn is related to height (and "diameter").  The
fact that height alone does a reasonable job is probably for the
same reason that, for example, the measurement of the perimeter
(P) of a rectangle is a reasonably good prediction of the AREA of
the rectangle.  Within a certain range of rectangle sizes. Take the
following example (modelled after one on p 199 of the excellent
introductory text "Statistics" by Freedman, Pisani et al., 2nd 
edition, Norton, New York, 1991).

Here are the dimensions of several rectangles.

rectangle number   L=length   W=width   P=perimeter   A=area
     (i)            (in)       (in)       (in)       (sq in)
-----------------------------------------------------------------
      1              11         8.5        39          93.5
      2               3         5          16          15.0
      3               4         4          16          16.0
      4               8         8          32          64.0
      5              11        14          50         154.0
      6               7         5          24          35.0
      7               9         5          28          45.0

    Model Equation M_H = - 51.4 + 4.7H + 10.8W

                            R-Square=0.9923 ;  Adj R-Sq=0.9884

In these examples, the simple-minded (but incorrect) model is a
good approximation to the true model

    mu[log Area] = 1.log[L] +1.log[W].
    
The "residuals" in the fitted model are not "errors" of  nature
but rather are induced by our imperfect model.

Remember Tukey's "all models are incorrect, but ..."

I don't know if the 12 observations in Tale 8-1 are "real" or
"made-up", but if they are real, I would go with a model such as

    Weight/(Height^3) = B0,

or
                      = B0 + B1.Age etc

or one might use the square, rather than the cube of height.

Note that even if we are strictly mathematical and non-biological
about it, the models with more terms in them don't do appreciably
better than the one with just height and age.  And don't be
impressed by the increasing R-squared's.  The r-square MUST
increase as terms are added -- but, as we will see, the model may
do less well at predicting new observations. That is the point 
of the "adjusted" R-square (Adj R-Sq).

And finally, speaking of models, recall Galton's calculation of 
the relationship between 

    Y  = child's (adult) height in inches, 
    
    X1 = average of the heights of the child's parents, and 
    
    X2 = the gender of the child (0 if female, 1 if male).
    
Galton wanted to come up with ONE correlation, rather than 2
gender-specific correlations. 

If we were doing this analysis today, using what WE now call 
multiple REGRESSION (we took the word from Galton, but changed 
its meaning!),  we might get an equation like

    mu(Y|X1, X2) = 64 + 0.6 (X1 - 66) + 4.X2.

Or, if we "partialled out" the influence of gender on height,
we might be left with a single partial correlation of 0.65 
between mid-parent and child's heights.

But notice what Galton did! If he had ignored gender, and just
computed the correlation in all 928 observations pooled together,
he would undoubtedly have gotten a lower correlation than the
0.67 he observed. The reason for this is the diagram I give in
my "correlations obscured" diagram in 607, with the mid-parent
height on the vertical axis, and the child height on the
horizontal -- the left cloud of data is for females, and the 
right cloud is for males, so that the merged cloud shows
less correlation than the gender-specific correlation.

Instead, "all female heights were multiplied by 1.08 before 
tabulation". In effect, this is the mathematical equivalent
of "turning females into males". In fact, it is a more elegant
wat than we do by fitting 

    64 + 0.6 (ParentHeight - 66) + 4 If male.
    
On the basis of this model, we mathematically turn females into 
males by adding 4 inches to their heights. But Galton's 
"proportional scaling" is more elegant, and more biologically
appealing than ours:

               "male equivalent"
               
   female   Galton's model  our model
   
    55"         59.4"          59"
    60"         64.8"          64"
    65"         70.2"          69"