NOTES ON KKMN CHAPTER 5
                   
                       jh 1999.05.29


5-1 Preview
===========

Note that in this chapter and indeed in all the chapters up to
21, it is implicit that Y is measured on a continuous (or
effectively continuous) scale.

5.2
===

"Finding the curve that best fits the data" was a purely
MATHEMATICAL problem long before it became a STATISTICAL problem.
The use of LEAST SQUARES in section 5-5-1 as the fitting
criterion does not involve statistical assumptions or models, but
it does involve a PARTICULAR definition of how we rank the fits
of different lines/curves.  If we were to treat the problem as
simply finding a line/curve that is somehow "close" to the data
points, then presumably it wouldn't matter if we looked at the
data with the X values on the vertical axis and the Y values on
the horizontal axis.  But, in regression, it DOES matter which
variable is plotted on the vertical axis.  The authors could have
made this clearer if they had used a non-symmetric phrasing: they
speak of approximating the "true relationship BETWEEN X and Y."
It might have been better to speak of how, at a particular X
value, the location and possibly the spread of all of the
possible Y values is related to or "predictable from", or "driven
by" the X value.

Note also that the object of interest is not the relationship in
the OBSERVED Y's, but rather in the UNOBSERVED Y's. That's why
the authors speak of the TRUE relationship; this idea that we are
trying to learn about the mathematical relationship BEHIND the
observed data, and that statistical inference is about the data
we did NOT observe, is a key one to keep in mind throughout the
course.

If we weren't interested in the "behind the observed data"
situation, and only in the empirical values, then there would be
nothing more to say after one has plotted the data.  The only
justification for pursuing a regression model would be if the
data were so voluminous that the line or curve was viewed simply
as a data-summarization technique -- a bit like is done with
data-compression techniques that involve some (negligible)
"loss" when the compressed data are "unpacked".

5-2-1
=====

The authors give the impression that, for any one unit, the "X"
and "Y" values are always observed "simultaneously", as happens
in what epidemiologists call a "cross-sectional" study.  Whereas
this may often be the case, it is better to think of the units as
having first been SELECTED as for their X values, and then
MEASURED (observed) as for their Y values.  This viewpoint serves
two useful purposes (1) it emphasizes that the X values are not
"random" in the same way as the Ys, and (2) that if one has a
choice, one can be efficient in which X's to study.  Imagine that
an investigator was interested in the relationship between height
and weight (or more correctly the influence of X = height ON Y =
weight).  Suppose (s)he could obtain a list of person's heights
from say a file of drivers' licenses.  Then it makes more sense to
deliberately study say 5 persons at each level of height, rather
than taking a blind sample that gives the naturalistic
distribution of heights in the source.  The heights of the
randomly chosen persons in the study are determined by (a) nature
(b) the stratified sampling scheme, if used and (c) the random
selection mechanism.  Nevertheless, in the regression analysis,
this randomness in the heights is not used.  However, stratified
selection is much more efficient (makes for less variable
estimates of the "slope" of weight on height) in this situation
than the use of a blind (unstratified) selection.

It is somewhat ironic that, from a biologic viewpoint X = height
is a variable that the "owner" has little control over, whereas
the Y =weight is more "elective" and somewhat more under the 
"owners" control.

One last point on the fact that we treat the X's in a regression
as "knowns": in actual fact, the heights in the drivers' license
database are self-reported and subject to both random errors of
measurement, and, even if measured well, non-random (!!) errors of
reporting.  Unfortunately, the effects of such "errors in X" are
typically ignored in reports of regression analyses.

In the example of the regression of Y = blood pressure on X =
age, there could also be errors in the reporting and/or recording
and/or computerization of age.

For persons who are the same age, the observed variation in BP 
across these persons is a composite of several sources (i) true 
inter-person variation (ii) true intra-person variation (again
biological) and (iii) measurement and recording errors.  In many
applications, it is not possible, without extra work, or outside
information, to separate these three components.

Figure 5-1
==========

Note the // and \\ marks to indicate that the Y and X axes do not
display zero.  This is good practice.  I doubt if any of the
plotting facilities in the commonly used software packages allow
for such marks.

5-2-2 Basic Questions
=====================

Note that at this stage, the authors do not really give the
purpose of the "model" (line/curve/ ...).  Is it to try to "get
close" to the Y's?  Is the line or curve supposed to be an
estimate of the "centre" of the Y data at each X value, and in
what sense do we mean "centre"?  Is the object to fit these
particular data well, or to estimate a model for all of the data
not shown in the figure?  Why are we fixated only on the
"centres" (however defined) and not on describing how
(vertically) VARIABLE the data are about these "centres".

5-2-3
=====

No comment!!  (But lots later!)

5-3
===

It is interesting that a few hundred years ago, scientists would
use A and B where we now use X and Y, and X and Y where we use A
and B i.e, they used X and Y for the coefficients and A and B for
the variables.

In high school, I learned the equation of a line as y = mx + b,
i.e. with m for slope and b for intercept.  Other commonly used
letters are a for "intercept" and b for "slope", i.e., y = a + bx.

A good example of mathematical straight lines are the
relationships between temperature in Fahrenheit and Celsius i.e.,

    F = 32 + (9/5)C
or

    C = (5/9)(F - 32) = -160/9 + (5/9)F

---------
Q: At what temperature, sometimes found in Canadian Prairie
   winters, is the value the same in F and C?
----------

Note that this perfect purely mathematical situation is the only
time that the slope of the regression of F on C, 9/5, is the
EXACT inverse of the slope of the regression of C on F, 5/9.
Wherever, because of imperfect measurements, or biological
variation, or whatever other reasons, the data points do not lie
exactly on the line, the slopes are not the exact reciprocals of
each other.

Note that it is not always helpful to write the equation in
terms of the intercept at X = 0; often, as in the example of C as
a function of F, it is better to start F at 32, and then show how
much C moves up from there, rather than from F = 0.  This is not
just because C = 0 at F = 32.  One might want to start "in the
middle" with say the usual ambient Canadian summer (or Addis Ababa
all year) temperature, say F = 68, and write the relation as

    C = 20 + (5/9)(F - 68)

Likewise, if our concern was with translating body temperatures,
we might start at F = 98.6, giving

    C = 37 + (5/9)(F - 98.6)

The point is that one can start anywhere, so why not start at
some relevant value, like the effective BEGINNING or MIDDLE of
the X values.  It doesn't have to be the middle exactly ...
any convenient location is fine.

If you read the article by mosteller (on the web page), note how
he writes of the "intercept at xbar".

Good examples of this are some equations in the Montreal Gazette in 1995.
(see article under Chapter 11 below)

"IDEAL" body weight as a function of height:

  IDEAL WEIGHT = 100lbs + 5lbs for every 1" over 5 feet, if female
               = 106lbs + 6lbs for every 1" over 5 feet, if male

i.e. 

       IW(lbs) = 100lbs + 5(H" - 60"), if female
       
       IW(lbs) = 106lbs + 6(H" - 60"), if male

We could equally have written these as

       IW     = -200lbs + 5(Height in inches), if female
       
       IW     = -254lbs + 6(Height in inches), if male

but they wouldn't be as useful in this form!

------------
Exercise: Convert these equations to kilograms and centimeters.
------------

The exercise will probably show you that the equations above are
not technically accurate, since the units do not match all the
way across.  Keeping track of the correct units makes it clear
what the units involved in the slope are.  The left hand side is
in lbs.  The "intercept" of 100 on the right must also be in lbs.
The height is in inches; to make sure that the product of the (H"
- 60") and the 5 is again in inches, we need to say that the
slope is not the UNITLESS 5, but rather that it is 5 lbs/1".
Then, the product of the lbs/inch and inches yields the desired
lbs in the intercept, and on the left hand side of the equation.

Incidentally, the data in Figure 5-1 remind me of the "rule of
thumb" of

 Blood Pressure | my age      = 100 + my age 

Technically speaking, this should be

 BP (mm) | my age in years    = 100mm + (1mm/yr) age in years

Note that mathematically this is the same as

                                125mm + (1mm/yr)(age - 25)

Going back to the use of Greek letters beta_0 and beta_1.  If one
were just speaking mathematically about equations of straight
lines, one would not use Greek letters for the slope and intercept.
The reason the Greek letters are used here is to denote
statistical PARAMETER values, just like we use Greek letters mu;
pi and sigma (think of beta's as difference in mu's, divided by
differences in X's!!).  In real life, we will never be able to
observe these parameter values.  They are technically
"UNKNOWABLE"  Instead, since all of our datasets are FINITE, we
will only be able to derive ESTIMATES of the parameters, using
the STATISTICS calculated from the observed data.  Moore and
McCabe remind us to associate "Parameter" with "population" (or
Universe, or IN THE ABSTRACT) and "Statistics" with "Samples".

In the KKMN text, a parameter estimate is denoted by the symbol
for the parameter (beta_1 for example) with a hat ("chapeau")
over it.

Some epidemiologists find statisticians' use of Greek letters
(and hats) pretentious, and use instead capital (upper case)
letters for parameters and the corresponding lower case letters
for the statistics (estimates of the parameters).  Thus for
example, where a statistician might use the Greek letter pi for
the theoretical proportion and pi_hat for an estimate of it,
these authors use P and p.  Likewise their use "OR" for the
theoretical odds ratio and lower case "or" for its empirical 
value. (Theoretical statisticians denote the theoretical
odds ratio by the Greek symbol psi).

This same, quite appealing, scheme of using upper case for the
theoretical and lower case the empirical (sample) version carries
over nicely for regression coefficients --- with B0, B1, B2, ...
for the theoretical (unknowable) coefficients and b0, b1, b2, ...
for their empirical counterparts. [Some statistical texts
compromise, using Greek betas for parameters, and lower case b's
for estimates of them!!].

Quite apart from any pretentiousness, and the math-anxiety
engendered by fancy Greek symbols, and hats, there is another
practical reason for the simpler "B0, B1, B2" / "b0, b1, b2" 
usage  -- they can be written in plain ASCII text!!

Moreover, the estimates in the output from computer packages (ie
the estimates of the parameters) never come with hats on them.
The only exception is on page 50, where the authors have typed
beta_0_hat and beta_1_hat (in Greek and with hats) onto the
output produced by SAS.  Note there that in the regular output
each estimate is called just that -- "parameter estimate".

5-4 Assumptions for Straight-Line "Model"
=========================================

Para 2.  The authors cannot shake their obsession with trying to
predict individual Y's.  If we measured Y and X for every
individual in a universe, we could get perfectly precise
estimates of the mean Y for persons with each value of X.  I
wouldn't call these "approximations"; rather I would say we
estimated the different (X-specific) MEAN responses very well.
But the fact that there are a large number of individuals at a
given value of X doesn't make them any less (or more!) variable
as to their Y values -- i.e. they remain INDIVIDUALS.  We are a
long way from being able to predict (or explain) why -- even if
they had the same values on the 10 most important predictors --
babies differ from one another with respect to birth weight.

para 3: The authors focus on the parameters of the STRAIGHT LINE
i.e. of the X-specific CENTRES of the y distributions, and ignore
for now what else one needs in order to "predict" where
individuals will be around these centres -- some X-specific
measure of VARIATION about the X-specific mean Y.

The 5 "assumptions" or conditions "needed" to make inferences
concerning the "true" or "theoretical" line, are overly
stringent.  If one truly is concerned with where the line is, one
can in certain situations make valid inferences with somewhat
less demanding assumptions.  This is especially the case with
respect to the "normality" (alias "Gaussian-ness").  If the
"independence" and "homoscedasticity" assumptions are not
fulfilled, the main casualty is that confidence intervals for the
parameters of the true (theoretical) line may be somewhat
inaccurate.  (Homoscedasticity would be important if one were
constructing growth curves, where the variation in height at the 
younger end of the age scale is less than at the older end).

Inaccuracies in standard errors and confidence intervals for the
parameters of the line (and thus for the MEAN Y at a given X
value) can generally be "fixed" without having to throw out what
may be a perfectly good straight line assumption just to try to
satisfy the other requirements.  

It is worth examining the logic behind steps 1-6 on page 41.

I would not fuss about "normality" at step 3, especially if in
steps 4/5 I might decide that a straight line was inadequate, and
I was going to try a more complex model.  I would leave
"normality" and "homoscedasticity" to the end, and even then I
would put them subservient to the fit of the line or curve.
(Incidentally, it is not clear how in step 5 one can "repeat step
3": (a straight line).  The steps are more accurately described
in Figure 5-2 than in the text above it.

5-4-1 (assumptions)
===================

1 - "Existence"
===============

I am not sure I really understand why this condition wouldn't
always be satisfied.  To me, this "assumption" is really a
DESCRIPTION of the regression situation itself (X-specific Y
means, which we hope to find a pattern for).

Incidentally, one shouldn't insist that such Y|X Distributions
have to exist for ALL possible X's in the range of X.  For
example, if Y was birthweight, and X was birth order, one
wouldn't insist on investigating the mean and standard deviation
of Y when X = 1.5 or X = 2.3.

If each mean (Y|X) is a "dot", and even if each mean is based on a
very large number of observations, the idea of a regression is
that we do NOT "join the dots", (as was done in Figure 5-4), but
rather that we find a smooth line or curve (a function of X) that
is a parsimonious approximation to the sequence of MEANS.

I mentioned in an earlier chapter that the "dots" could have been
some other measure of the "centres" of the X-specific Y
distributions.  If it weren't for the intractability of working
with them, medians would have been a useful alternative.

Indeed, the published data on Canadian gestational-age-specific
birth weights used medians rather than means -- and 10th and 90th
percentiles rather than standard deviations.  In this situations,
the data are so extensive, and the pattern of centres so smooth a
function of age, that there was no need for any further
"smoothing" by regression.  And in any case, the focus is always
gestational-age-specific!

2- Independence
===============

The invalid statistical conclusions are in respect to interval
estimates (confidence intervals) rather than to point estimates.

3 - Linearity
=============

In one sense, of all 5, this is surely the most important. If one
cannot well appropriate (estimate) the "centres", then what hope
does one have of going further and describing the range of
variation of the individual Y's?

On the other hand, we don't want the best to be the enemy of the
good [have I got this the right way round?].  There is a great
danger of overcomplicating the fitted models, and even of being
led astray by "chasing" every, single twist and turn.  The
complex function fitted to average alcohol consumption versus age
[ref] in a population and of WBC as a function of time in an
individual are two examples of "computers over reason".

Likewise, if the data are extensive, the pattern of means may be
an "adequately accurate" straight line function of X, but a
formal statistical test may indicate that a straight line "does
not fit." The departures from linearily may be extreme in p-value
terms because of the large n on which the test was based, but of
no practical importance in the big scheme of things.

In addition to equations 5.2 and 5.3, it might be good to make
the regression even more explicit:

    Y|X = mu(Y|X) + individual variation

          mu(Y|X) = B0 + B1.X  
          
                          [am using "." for multiplication here]

Think of 5.2 as the "systematic" part, onto which one adds
"individuality".

As discussed earlier, the authors are being a bit unrealistic in
giving the impression (2nd para, p 45) that the large variation
in Y = birth weights across infants born at say X = 37 weeks are
"errors" from the average birthweight of 37-week olds.  The term
"deviations" is a somewhat less "loaded" term.  Even then, people
could be forgiven for imagining that statisticians with their
preoccupation with errors and deviants deserve the accolade of
the "dull" or morbid scientists more than economists do.

4 - Homoscedasticity [equal variances]
======================================

Statisticians have a habit of thinking and writing about
variation in the squared scale. They use the word "variance" in
a technical sense, for the square of the standard deviation. 
There are good theoretical reasons for working this, the most 
important being that the variation -- whether in the natural
scale or in the square  -- of an aggregate statistic usually 
involves the sums of the variances associated with the individual
components. One cannot combine standard deviations directly; one 
must combine their squares. Then to get back to the natural scale,
one must take the square root.

For the purposes here, the homoscedastic assumption can equally
be written

    sigma (Y|X) = sigma, for all X.

Homoscedasticity or rather the far more common heteroscedasticity,
is not a first-order worry, UNLESS the concern is with
establishing X-specific percentiles for individual values of Y 
(as in growth charts). Mercifully, this textbook is a lot less 
fussed about heteroscedasticity than others (e.g. Neter et al).  
Moreover, one of the "fixes" (transformations -- see p. 252) can 
ruin an otherwise perfectly reasonable linear relationship. Instead, in
such instances one can preserve the linearity and give lesser
weights to observations that are more variable (p. 250).

Assumption 5 Normality (Gaussian-ness)
======================================

The author's plea to give considerable leeway before switching
from an otherwise reasonably-filling model should be heeded.  The
one situation where the assumption is CRITICAL is when the
fitting goes beyond the usual focus on Centres to the estimation
of PERCENTILES of the distribution of INDIVIDUAL Y values.
Again, a good example is the construction of growth charts, where
not only are the individual height variations wider at the older 
than the younger end, but the variations may not even be symmetric 
-- let alone Gaussian!

In situations where the focus IS on the CENTRES (i.e. on the line
or curve), and the sample sizes are reasonably large, the
Gaussian-ness or non-Gaussian-ness of what the authors call the
E's becomes a non-issue.  This is for the same reasons that the
Z- or t- distribution is a reasonable reference distribution for
STATISTICS like a mean, or a difference of means.  The Central
Limit Theorem, APPLIED TO THE E's, ensures that even if the Es
are not Gaussian, aggregate statistics calculated from them have
closer-to-Gaussian variation.

I emphasized in an earlier chapter that estimated slopes
(regression coefficients) were linear combinations of Y's; each Y
in turn is a sum of a constant but unknowable, mu(Y|X) and an E.
So the random component of the slope is a linear combination of
E's.

5-4-2 Summary & Comments
========================

The first paragraph makes an important point, namely that the
Gaussian-ness (and homoscedasticity) are in terms of the E's, not
the Y's.  All too often, students (and others who should know
better) test or inspect for "normality" in the OVERALL dataset
(i.e. collapsed over levels of X) rather than X-SPECIFICALLY.
Recall my comments about the near Gaussian-ness of GENDER-SPECIFIC
adult heights, but the clear non-Gaussian-ness of the heights of 
undifferentiated adults.

The second paragraph makes some VERY important distinctions.
Please try not to mix them up.

It's too bad that the authors, even though they acknowledge the
confusion the term "normal" can cause, do not themselves try to
avoid the term.

I'm not sure where the term "normal" distribution originated, but
my guess that it goes back to Quetelet and Galton and the idea of
"l'homme moyenne".

Lastly, on another historical note, some would argue with giving
all the credit for the equation of the "Bell Curve" to Carl Gauss.

For portraits of some of those who "discovered" the equation, see

   www-groups.dcs.st-and.ac.uk/~history/PictDisplay/Gauss.html

Stigler's book gives some of the history of the equation and of
the large subsequent role of Quetelet and Galton.

5-5 Determining the best-fitting Line
=====================================

Eyefits tend to overestimate the slope. see Mosteller article on 
web page; I will also try to demonstrate this.  The reason has to
do mainly with the criterion our eye (brain) uses.  Whereas the 
least squares method minimizes the average squared VERTICAL 
deviation form of the Y's the line, our eye uses instead the 
PERPENDICULAR distances of the points from the line. Mosteller
referes to this as the major axis, or principal component. Thus, 
our eye would tend to give the same line whether we asked for a 
fit of Y on X, or X on Y.  The least squares line of Y on X does 
not have a slope which is the reciprocal of the slope of X on Y.
The eyefit, using the perpendicular  deviations, is usually 
in-between the two Least Squares lines.

5-5-1 Least Squares Estimator
=============================

Note that we are minimizing the sum of the squared deviations of
Y from the line.  Note also that this is a purely mathematical
criterion, leading to a purely mathematical solution.

5-5-2 Minimum Variance Estimator
================================

Note that here the focus is on getting good estimates of beta_0
and beta_1 per se, rather than getting a line that is close to
the data.

5-5-3 Least Square Solution
===========================

The method, and the proof, date back to just after the French
revolution.  In one of the most famous applications, those
charged with deciding how large was the circumference of the
earth (upon which the length of the metre was based) had to
reconcile the fact that 21 observations, involving 3 unknowns,
didn't "add up", and so must have contained errors.  Rather than
solving the equations 3 at a time, and averaging the 7 sets of
answers, Legendre arrived at the elegant and less arbitrary Methode
des moindres quarres.

  www-groups.dcs.st-and.ac.uk/~history/PictDisplay/Legendre.html

A correction regarding computer packages (p. 49):  SAS, SPSS, 
SYSTAT, MINITAB and GLIM are available for Mac computers.

Once one has calculated the slope beta_1_hat, or b1, via equation
5.4, it is easy to see how one obtains the fitted intercept via
5.5.  One uses the fact that the regression line passes through
the point (Xbar, Ybar).  Then, to find the intercept, one simply
"follows the line" until one reaches X = 0. If Xbar is positive,
one travels to the left, by  a horizontal distance of Xbar. Since
the slope ("rise"/"run") is b1, the vertical drop (or rise) from
Ybar is b1 times Xbar, leading to equation 5.5.

Equation 5.7 "CENTERED" VERSION
===============================

This re-expression is VERY IMPORTANT. I have alluded to it earlier
when discussing Fahrenheit as a function of Celsius (and vice-versa).
It is particularly important if in fact the data are far
from X = 0 i.e. if they are say yearly Y's for the years X = 1970
to X = 1999, or Y = Number of hurricanes to strike the U.S.
each decade (X) since the year 1900AD.

[data, sas program and documentation in www page for course 626]

If we put in the data as

   X                   Y
 
  190 (1900-1909)      6
  191 (1910-1919)      8
  192 (1920-1929)      5
  193 (1930-1939)      8
  194 (1940-1949)      8
  195 (1950-1959)      9
  196 (1960-1969)      6
  197 (1970-1979)      4
  198 (1980-1989)      6
  199 (1990-1999)      2 

then the fitted equation is  (using "ave" for "average")

 Ave Y {No. hurricanes/decade} = 76.9 - 0.3636 X
 
This is not very helpful, since it first requires one to 
substitute values for 1900 (X= 190 decades since AD) and 1990 
(X=199 decades) in order to know roughly what numbers per decade 
we are talking about.

The intercept (estimate 76.9 of Y for the decade starting at 0 AD)
is not of any interest.

Moreover, it is of more than doubtful precision, given the
large extrapolation error in projecting back that far from a
(relatively) short series. Indeed, quite apart from the
statistical dangers of back-projection, this example is 
interesting for its illustrating of the numerical errors 
caused by rounding

The equation is 76.9 + 0.3636X

If we substitute X=190, and use 4 decimal places in the slope
ie -0.3636, we get a fitted Y of 7.8 for X=190 and 4.5 for X=199.

If we use just two decimal places for the slope, ie. b = -0.36, 
we get 8.5 for X=190 and 5.3 for X=199.

Small errors in the slope cause big difference when it is 
used to "project" the line forward from the 1st decade
of the first millennium.

How about STARTING at the year 1900 (decade 190)?

If we put in the data as

  X                    Y
 
  0 (1900-1909)        6
  1 (1910-1919)        8
  2 (1920-1929)        5
  3 (1930-1939)        8
  4 (1940-1949)        8
  5 (1950-1959)        9
  6 (1960-1969)        6
  7 (1970-1979)        4
  8 (1980-1989)        6
  9 (1990-1999)        2 

 Y {No. hurricanes/decade} = 7.8 - 0.3636 decades since 1990

You see that it makes more sense to set our "origin" at 1900.

And, even if you carry fewer decimals, say

 Y {No. hurricanes/decade} = 7.8 - 0.4    decades since 1990

you will not create big errors. e.g. 7.8 - 0.4 x 9 gives 4.2 
for the last decade, vs. 4.5 if you carry out the calculation 
with b = -0.3636.

Remember this example for when we we come to discuss the structure
of the formula (p p53/54) for the precision of the estimated 
intercept!

This example reminds us that the "origin" is arbitrary and that
-- contrary to the impression given by the text -- ANY sensible 
starting point, NOT JUST THE MEAN, works.

On a data quality issue: Some of you may have already objected 
that the Y for the last decade in the series may not be correct,
since te decade isn'y quite over. in fact, the data only go
up to 1995  (Source USA TODAY, August 1995).

For the text of the article on hurricanes, see 

  http://www.epi.mcgill.ca/hanley/c626/
  
There is one technical statistical reason to "center" the data
around X = Xbar rather than say X = Xmin.  This is covered later,
on pages 245-248 of the text.


Output 5-1 (data in table 5-1)
==============================

The SAS commands to produce all of the output shown
(as well as a lot more not shown!!) are:

PROC REG ALL;
MODEL SBP = AGE;

Most times, you can omit the "ALL" -- and also save some trees!
If you wish, you can use selected options.

You can reach the interactive INSIGHT facility in SAS
via the Globals menu
 
 Globals -> Analyze -> Interactive data analysis

See the INSIGHT Primer (in Acrobat Reader .pdf format)

You can turn on/off the amount of output.

Even when one doesn't ask for extra items, most printouts have
more detail than the user requires.  KKMN annotate the important
ones here.  However in the interest of completeness, I will go
through then all for this first example.

The INTERCEP shown in the first row of the output isn't really a
"variable" in the usual sense of that word.  The program didn't
actually follow formulae 5.4 and 5.5 to get the slope of 0.97 ...
and intercept of 98.71 ... shown in the last two rows of the
table.  PROC REG can handle multiple (k) X's simultaneously, but
in such cases, the formulae for the various beta_hat's cannot be
written out explicitly in closed from.  Instead the software uses
matrix methods, with the matrices in question having as many
columns as there are coefficients (k + 1).  The first column is
set to XO = 1 for every observation; then the regression equation
can be written as

    E(Y|X1, X2, ..., Xk) = B0.X0 + B1.X1 + ... + Bk.Xk.

In our example then we have Y = BP, X0 = 1, and X1 = Age.

SAS labels the "X0" as "INTERCEP" when giving descriptive 
statistics at the beginning, and it labels "b0" as "INTERCEP"
when giving the estimated coefficients.

Note that the fitted regression goes through the point 

   X = AGE =  45.13 (XBAR),
   Y = SBP = 142.53 (YBAR).  
   
To me, the data "start" in the MIDDLE of Fig 5-1.

From the Analysis of Variance table, concentrate first on the
corrected total sum of squares of 14787.  This is nothing more
than the sum of the squares of each Y from Ybar = 142 (I'm
truncating some of the extra decimals shown in the printout).

The 30 such squared deviations FROM (note) YBAR is 14787.  Divide
this by the usual 29 degrees of freedom (only 29 of the 30
deviations are "independent") and you get 14787/29 = 509.9, the
S-squared(Y) in the descriptive statistics.  We might prefer to
think of the 30 BP's having a SD equal to the square root of
509.9, or 22.58.

At this stage, the only other statistic to note is the Mean
Square for Error of 299.7, and its square root (Root Mean Square
Error, abbreviated to Root MSE) of 17.3.

This says that whereas the "global" variation in SBP on Fig 5-1
can be measured by a SD of 22.58, the "age-specific" variation in
SBP is 17.3 i.e. 23.5% less than the "non-age-specific"
variation.

Put another way, this says the a little less than 77% of the
gross SD remains "unexplained".  For technical statistical
reasons, reductions in VARIANCE, and the percent of VARIANCE that
"remains" are more commonly used.  So here, it would be more
useful to report that 76.5% x 76.5% = 57% of the variance remains
and 100% - 57% = 43% is "explained".  Needless to say, the
reduction in variance looks bigger than the reduction in standard
deviation.  But bear in mind how you are going to say this in the
context of say income or SES "explaining" a certain percentage of
the variance in fertility i.e. the overall variance in fertility
is maybe 0.67 square children per square woman, and only 0.5 if
we consider within SES-group variation in fertility.  Some 25% of
the variance is explained, but the reduction in the standard
deviation is only 1 - sqrt [0.75] = 13%.

I did these calculations assuming half the women were in one SES
category, and that of the fractions with 0, 1 and 2 children were
1/4, 1/2, 1/4 whereas in the other half of the women, the fractions
with now 1, 2 and 3 children were 1/4, 1/2 and 1/4.  You might want
to check my arithmetic!!

The estimates of the "intercept" and slope are 98.7 and 0.97.  It
goes without saying (but I'll say it anyway!!) that it is not
safe to call the 98.7 our best estimate of the SBP of newborns.
Nor for that matter should one say from these data that "SBP
increases as age increases".

Mind you, the equation fits well with the "100 plus your age" I
heard once.  When you add in the Standard Errors (or if you like,
the Statistical Uncertainty) of the estimates, the "100 plus your
age" is quite a good round approximation to the
 
   "(98.7 +/- 20) plus (0.97 +/- 0.42) times your age"
   
one might report from the regression analysis using say +/- 2
standard errors for each coefficients. [In fact, the slope and
intercept estimates are not independent of each other: if one is
an overestimate, there is a greater than 50% chance that the
other is an underestimate --- but more on that later!]

You can think of the 98.7 as an estimated mean SBP for persons
aged 0!  Better still, rewrite the equation as

    Ave(SBP|age) = 142.5 + 0.97 times (age minus 45)

Clearly there is no point in testing the 142.5 against 0 (Null
hypothesis is false, as long as the subjects are alive!).

Think of the 0.97 (or better still 9.7) as the estimated
difference in the average SBP of two populations 1 (10) year(s)
apart in age.  You can see this even better if in Fig 5-8 you
take ages 20 years apart, since there is a vertical distance of
20 min (1 tick mark) for this horizontal difference of 2 tick
marks.

5-6 SSE & the estimator of the (common) X-specific variation of E
=================================================================

Note that the Greek sigma squared refers to the X-specific
variance of the E's (And NOT the overall variance of the Y's).
It makes sense then that this sigma-squared is estimated using
the deviations of each Y from its estimated X-specific mean.  It
is the same logic as when in course 607, we estimate a "regular"
variance using squared deviations from a single mean Y or when, in
connection with a t-test on the means of two groups, we estimate
a common variance by pooling the within-group deviations.  The
difference here is that EACH deviation is from a different mean,
given by the fitted line.  Since we assume E has the same
"amplitude" no matter what the X, we "collect" or "pool" the
deviations (residuals) from the line.

If the 14787 in the example is a SUM of 30 squared residuals or
errors, and if 28 of them are independent, then dividing the SSE
(sum of squared errors) of 14787 by 28 gives an "average" squared
deviations (error) of 299.76.  That is why the column label is
"Mean Square".  Putting this label together with the row label
(Error), we get Mean Square Error (MSE), or average squared
error.  The printout doesn't explicitly label the 299.76 as the
MSE, but it does label the square root of 299.76, namely 17.3, as
the "Root MSE".  Using the label MSE would save some steps later,
S-squared Y|X = MSE, and its square root S[Y|X] = RMSE are used
extensively in the formula for inferences concerning the slope,
and the regression line (next four sections).

Why in this example do we divide the SSE by n-2 to get the
(average or) mean squared error MSE?

In 607, you learned to divide the sum of squared deviations by n-1 to
get an unbiased estimate of sigma squared.  With n = 1
observations, there is no opportunity to assess variance; with n
= 2, you have 2 deviations D1 = Y1 - (Y1 + Y2) / 2, and D2 = Y2 -
(Y1 + Y2) / 2, BUT since D1 and D2 are mirror images of each
other and add to zero, there is really only 1 INDEPENDENT
assessment of variation.  With n = 3, you have 2 independent
deviations, etc.

But in these examples, each Y was an estimate of the SAME mean
mu, and sigma-squared was the average squared deviation about mu.
Ybar is our best estimate of this single mu, and Yi - YBAR
provides an estimate of sigma.  But now, with linear regression,
each Y varies about a DIFFERENT mu.  The fitted line is our
best estimate of the different mu's, and Yi - LINE provides an
estimate of sigma.

When there is but one mu, it takes only one linear combination of
the Y's, i.e. 

    (1 / n) Y1 + (1 / n)Y2 + ...
    
to estimate it. The remaining n - 1 combinations of Y's can be
used to estimate sigma.

When there is a "line of mu's", it takes two combinations of the
Y's to estimate the line of mu's.  One combination goes to
estimating the slope, the other the intercept.  That leaves n - 2
independent pieces of information, that can be used to estimate
the (common) sigma.

To make these ideas concrete, fill in the missing values in the
following 6 situations:

situation 1
***********

i        Y          mu_hat      E_hat     E_hat squared
--       --        --------    -------  -----------------
1        3             5          -2           4
2        ?             5           ?           ?
--       --        --------    -------  -----------------

# of INDEPENDENT E_hat's           1

ESTIMATE OF Sigma-Squared:                    ??

situation 2
***********

i        Y          mu_hat      E_hat     E_hat squared
--       --        --------    -------  -----------------
1        9            7           2            4
2        4            7          -3            9
3        ?            7           ?            ?
--       --        --------    -------  -----------------

# of INDEPENDENT E_hat's:         2

ESTIMATE of Sigma-Squared:                    ??

situation 3
***********
                                               E-hat                                           
i          X          Y        mu_hat      (Y - mu_hat)
--        ---        --       ---------   --------------
1          3          5           5            0
2          7         13          13            0
--        ---        --       ---------   --------------

NUMBER OF INDEPENDENT E_hat's:                 0

FITTED LINE:                   mu_hat = -1 + 2X = 9 +2(X - 5)

situation 4
***********
                                               E-hat                                           
i          X          Y        mu_hat      (Y - mu_hat)
--        ---        --       ---------   --------------
1          2          9           7            2
2          3          ?          10            ?
3          5          ?          16            ?


NUMBER of INDEPENDENT E-hat's:                 1

FITTED LINE:                  mu_hat = 1 + 3X

situation 5
***********
                                               E-hat                                           
i          X          Y        mu_hat      (Y - mu_hat)
--        ---        --       ---------   --------------
1          1          3           2             +1
2          2          4           6             -2
3          3          ?          10              ?
4          6          ?          22              ?

NUMBER of INDEPENDENT E_hat's:                   2

FITTED LINE:                   mu_hat = 2 + 4X = 10 + 4(X - 3)

situation 6
***********
                                               E-hat                                           
i          X          Y        mu_hat      (Y - mu_hat)
--        ---        --       ---------   --------------
1          0          1           2             -1
2          2          ?           8              3
3          4          ?          14              ?
4          6          ?          20              ?
5          7          ?          23              ?

FITTED LINE:                     mu_hat = 2 + 3X

(Don't spend TOO LONG on this one!)


5-7   Inferences re slope & intercept
=====================================

I have already referred to the fact that the slope and intercept
estimates are linear combinations of the Y's.  Thus, if the Y's
have Gaussian variation, so then will the parameter estimates.
But even if the Y's are not Gaussian, the parameter estimates,
being linear combinations, will have closer-to-Gaussian
distributions, and for all practical purposes Gaussian
distributions when n is large (usually 30 or more, but 50, or even
100, or more) if the distributions of the E's is VERY highly
skewed).

The denominators of equations 5.9 and 5.10 are the standard errors
of the slope and intercept estimates.  In medical publications, and
in computer printouts, and in some modern texts, they are referred
to directly as SE's.  The "S" notation eg. S_subscript_beta1_hat
is very cumbersome.  Instead, one can write SE(slope estimate),
etc.

The n - 2 degrees of freedom comes from the fact that the MSE is
calculated using n - 2 "independent" residuals and the square
root of this (RMSE) is substituted for sigma in the formula for
the standard deviation of the estimator.

KKMN give the (theoretical) standard deviation of the slope
estimator as

    sigma / (Sx times the sq. root of n - 1)

Don't make a big deal of the n - 1 here!

It is always better to think of standard errors of statistics as
having the square root of the sample size in their denominators.
See my 607 notes on the correlation / regression chapter of Moore
and McCabe (M&M Chapter 9) for a heuristic approach to understanding
the structure of the standard error of the slope [I write about
factors that affect the "reliability" of the slope].


5-8-2   The intercept
=====================

I "second" the "In any case, the intercept (zero or not) is
rarely of interest".  Given that, I urge that you, whenever
possible, to rewrite the regression in the "centered form

   mu_hat(Y|X) = Ybar + beta1_hat (X - Xbar)


5-9   Inferences concerning the line
====================================


I like the way the authors write the equation for the theoretical
(but unobservable line)

    mu (Y | X) = beta0 + beta1 (X)

It is a pity that they don't continue in this vein and write 
equation 5-13 as

    mu (Y | X)_hat +/- "t times its SE"   (i.e. +/- t.SE)


5-10 A new value of Y at X0:
============================

The limits for this new Y (for an INDIVIDUAL) are often confused
with the confidence limits for mu (Y | XO).

The book by Neter has good exercises which help distinguish the
two concepts [my best example is what to say to the judge
regarding the alcohol and eye movement data, or what to tell
parents as to when, on the basis of some predictive model, when
their infant might first start sleeping through the night].

See exercises on Chapter 5.

In the problems (starting at p. 60), the examples and/or wording
are not always that compelling as to whether prediction for the
individual, or estimation of the mean of all individuals at that
X value) is the more appropriate) task.

For example, in problem I(f) p. 62, the object is the MEAN
response, so the wording would be better if it referred to the
mean response for 8-day-old chicks (it doesn't make much sense to
ask about the MEAN response for ONE chick!

Problem 6 (p. 69 - 70) is a good example of how the focus might
well be on the individual, but the question posed is about the
MEAN duration of sleep in children of a certain age, after all,
what good are the confidence limits on the mean when parents are
trying to tell their child that (s)he is "out of line"?

The is the same tendency in the medical literature to present
confidence intervals for the mean Y at a given X when the focus
is in the individual patients.  Can you find some examples?
(Hint: look at the presentations of data on method-substitution
studies eg. pulse oximetry versus blood levels, bilirubin by
noninvasive versus invasive methods).