Notes on KKMN Chapter 10    1999.05.26

Preamble
========

This chapter is not essential.  If you prefer to work with slopes
than correlation coefficients, it doesn't add much new.  But what
correlations do allow us -- that slopes do not -- is to use
dimensionless measures of associations --- particularly important
if one has no familiarity with, or interest in, the scales on
which Y and the X's are measured.

The ideas of "partialling out" or removing variation associated
with other variables is the same whether one works in the slope
scales or the correlation scales.

In the biomedical literature, one doesn't see partial
correlations reported very often.  The impression is that they
are more common in the psychology and social sciences literature.

10-1
====

Nothing new here.  But don't bother too much with features 4 and
5.  Instead, think of the X's as our choice, not Nature's.

10-2
====

You can get this table in SAS by running PROC CORR i.e. 
    PROC CORR;
      VAR Y X1 ... Xk;

In INSIGHT, click the Y and X's into "Multivariate Y".

In their example, you would have to create X3 = X2_squared
first.

Their choice of X3, as the square of X2, makes for some trouble,
since we would not usually think of having Age_squared without
having Age.  We will discuss in a later chapter the special
issues of powers and products of existing variables, and how to
stay out of trouble when using them.


10.3     Multiple Correlation Coefficient R
===========================================

I prefer the first of the two versions of equation 10.1 (p. 163),
where R(Y | X1, ... Xk) is expressed as a straight correlation
between the observed and predicted (fitted) Y values.  I am less
comfortable with percentages or fractions of variance.

I don't know why the authors do not refer to the "SSY - SSE" 
in the second version as "SS Regression". Moreover, they could
have --- just like in simple linear regression -- shown R-squared
as the ratio of two average squared amplitudes, i.e. 
 
     average squared deviation of Yhat from Ybar, 
and 
     average Squared deviation of   Y  from Ybar,
     
i.e.

                 average {(Yhat - Ybar)^2}
R-square   =     -------------------------   .
                 average {(  Y  - Ybar)^2}


In INSIGHT it is easy to check numerically that version 1 holds
-- just run a correlation between the Y's and the Yhat's.

I suspect that they use the second version to actually calculate
it.


10-4 Relationship to Multivariate Gaussian Distribution
=======================================================

I consider this version to be too restrictive.  However, the
equation linking mu(Y | X1,X2) to mu(Y) and the other 2 terms also
holds -- with mu's replaced by sample averages -- for less
restrictive situations.  Indeed, we can turn them around so as to
get an equation for the difference in average response (Y) in two
groups (with the groups defined by say X1 = 0 and X1 = 1) with or
without adjustment for the fact that these two groups differ with
respect to a confounder X2. This is using multiple regression to
adjust for confounding.

We will return to this in Chapters 11 and 15.


10-5    Partial Correlation Coefficient
=======================================

Whereas the MULTIPLE correlation is between Y and ALL of the X's
in question, the PARTIAL correlation is between Y and a single X,
with one or more other X's "removed" or "controlled for", or
"partialled out".

Think of the "zero order"correlations as Y with X, not
controlling for any other X's.


SAS output for Partial Correlations
===================================

Note that the two types of squared partial correlations (pp. 166,
bottom) are the analogs of the "variables added in order" and
"variables added last" sums of squares in the previous chapter. 
Note the imaginative and highly self explanatory labels "Type I"
and "Type III", just like "Type I SS" and "Type III SS," that
statisticians are good at creating (just like type I error and
type II error!)

Again, I would have preferred if SAS had been a bit friendlier in
outputting the partial R's rather than their squares. It seems
somewhat circular to have to go through the slope estimates in
regression just to get the signs of the partial correlations.

Another way to get at the partial correlations directly is via
PROC CORR in SAS   viz.

   PROC CORR;
   
   VAR Y; 
          WITH X1, X2, X3; 
                             PARTIAL X4;


would get you the partial correlations of Y with each of X1, X2
and X3, in each case "controlling for" X4.


10-5-1    Tests for Partial Correlations
========================================

No surprise that the tests are the same as for the corresponding
beta's!  Or, if you prefer, PROC CORR above gives p-values
directly.


10-5-2    Partial Correlation & Partial F test
==============================================

One shouldn't have any problems with the formula for the square
of rho (Y,X1 | X2), if one is comfortable with the "one X" case.

If we only had X1, then the formula for rho(Y,X1) is 


                      sigma_sq(Y) - sigma_sq (Y | X1)
  rho-sq (Y, X1) =    -------------------------------
                               sigma-sq(Y)

If now we "condition on" X2, which is the equivalent of looking
at rho-sq(Y,X1) within narrow "slices" of X2, then the formula
goes through directly.  So, all one needs to do is write "|X2"
after each Y above.


Formula 10.2 is derived by substituting sample statistics for 
population parameters.  Note that the parameter version assumes
a "trivariate Gaussian" distribution of Y, X1 and X2.


10-5-3    Another formula for r(Y, X|Z)
=======================================

Formula 10.3 is a useful version, because it can be derived
directly from the simple part wise correlations between Y, X and
Z. Moreover, as the authors explain in part at the top of p. 169,
one can tell what happens if X is positively, negatively (or not
at all) correlated with X.

I'm delighted to see the authors use a different letter Z for the
"control" or "confounding" variable.

In epidemiology, we call r(Y,X) the "crude" correlation between
Y and X ("outcome" & "exposure").  r(Y, X|Z) is the "adjusted"
correlation.

10-5-4    Partial Correlation as a correlation
          of two sets of residuals
==============================================

This formulation is the same as the multiple regression as
a series of simple regressions" concept, which i mentioned in
my 607 notes. by the way, i first learned about this in the
excellent regression text by Draper and Smith.


Table 10-3 (p 171)
==================

We use these diagrams a lot in teaching confounding in 
epidemiology. 

Recall the analysis Galton did on the correlation between
X1 = parents' height and Y=the heights of their offspring.
There was another variable Z=gender of the offspring.
Galton handled it in an elegant way. Where does it fit in
in the 4 cases in Table 10-3?  


10-6 and 10-7
=============

Follow if you wish. One doesn't usually see these reported in
the biomedical literature.