Notes on KKMN Chapter 6     jh 1999.05.19

See also the article "Thirteen ways to look at the correlation
coefficient"

Preamble
========

You will notice that in my 607 notes on correlation and
regression, I put correlation first, (as do M & M in their 3rd
edition) since it is the more "neutral" of the two.  The
correlation of X1 and C2 (or Y1 and Y2) is the same as the
correlation of X2 and X1 (or Y2 and Y1).  Not so with regression!
It matters which is regressed on which.  The book would have done
better to define r in terms of pairs of X's (or pair's of Y's)
rather than X - Y pairs.

Obviously, if the authors define the slope B or (beta_1) of Y on X
in chapter 5, then they have to express the equivalent formula
for r in terms of b, rather than what to me is the more natural
"b in terms of r", namely

    b = {S(Y) / S(X)}  r

One way to keep this formula straight is to think of the units
(dimensions) each component is expressed in.  The b (slope) on
the left-hand side is in terms of delta mu(Y) / delta X, or
simple Yunits / Xunits.  The units for S(Y) / S(X) are again
Yunits / Xunits, since the SD of Y and X are in Yunits, and
X units respectively.  r itself has no units (it is a positive 
or negative fraction, between +1 and -1).  Thus the units on the
left side of the equivalence agree with those on the right.

The fact that r is dimensionless means that the correlation
between say the daily temperature in Vancouver (V) and Montreal
(M) is the same whether one city's temperatures are measured in
Fahrenheit and the other's in Celsius, both are in Fahrenheit. I
could have chosen correlations of heights (in cm or inches) and
weights (in lbs or kilograms) to make the same point, but I
didn't want to use two variables where one of them (weight) is
more likely to be thought of as a "Y" variable and the other
(height) a more natural "X" variable.

If you have to choose between knowing just r-squared or r, choose
r! Otherwise, the direction of the correlation is lost in the
square (just like a z statistic for 2 proportions has more
information than its square, the chi-square statistic!).

6-2
===

It is more helpful to say "an individual with an ABOVE AVERAGE
value on one of the two variables, is likely to be ABOVE AVERAGE
on the other."

This way of thinking about it will help you correctly determine
the (approximate!) correlated in the following pairs of variables.

Technically speaking, independence of two random variables is a
stronger property than a lack of correlation: one can concoct
examples where the correlation is zero, yet there is a strong
relation.  Fig. 6-1(d) is a good example.  That's why it is a
good idea to speak of a LINEAR relation or association (or the
absence thereof).

The Greek letter (that looks like the letter p in italics, sans
serif) for the parameter (in the population) is pronounced "rho".

2nd complete paragraph, p. 90
=============================

Psychologists have found that statistics students will give
different eyeball r's for the same data, depending on how the
graph is set up (data crowded into the middle, with lots of white
space; data all the way to the limits of the axes; frames around
none, 2 or all 4 sides, whether the frame is "landscape" or
"portrait" etc.)

Sometimes, data pairs are presented in time series from --eg. the
x-axis might be calendar time, and the two data items for each
month might be the price of a barrel of oil at the well and the
price of a litre of gasoline at the gas-station.  In this
display, it is even more difficult to judge the correlation.

See the helpful article (on the class web page) by Chatillon on a
way to more objectively and fairly reliably estimate r by eye.

3rd and 4th paragraphs
======================

The diagram in my 607 notes explains this with positive products
in the ++ and -- quadrants, and negative products in the +- and -+
quadrants. [the dividers for the quadrants may have drifted
in the translation from a Mac Word to a MS Word 6 document!]

The correlation discussed in this chapter is Pearson's
(product-moment) correlation.  There is also, in non-parametric
statistics, Spearman's rank correlation, which is obtained by
calculating the Pearson correlation on the pairs of ranks rather
than the pairs of raw data.  It is invariant to monotone
transforms -- for example, the Spearman correlation for the data
in Figure a on page 61 is 1, whereas the Pearson correlation is
sqrt[0.7442] = 0.86 In Figure b, the Spearman and Pearson
correlations are 1 and sqrt [0.9983] = 0.99 respectively.

6-3
===

Figure 6-3 looks like the roof of Montreal's Olympic Stadium.  It
might be better to think of it concretely in the context of say
the numbers or relative numbers (7) of persons with a certain
value of X = cholesterol and Y = blood pressure.  Think of the 2-D
histogram as like high-rise towers (representing the
frequencies) sitting on the different "blocks", where the
"north-south" address is a cholesterol category, and the "east-west"
address is the blood pressure category.  The "tallest" blocks
(most populated cholesterol-bp categories) would be in the
"downtown", with the shorter high-rise buildings (with fewer
persons in these categories) in the "suburbs".

As the authors say, this coverage of the bivariate normal
distribution is not central to regression, but is simply another
justification (if we needed one) for the least squares estimator
of the regression line.  Equation 6.3 is the theoretical line, in
Greek. The equation two down from it [with hats, and with Ybar
and Xbar instead of mu(X) and mu(X)] is the estimator.

Equation 6-4, and its sample or empirical counterpart (the first
equation on p. 93) is usually carried over to regression where
the X's do not have a natural Gaussian distribution.

It is instructive to rewrite equation 6.3 in terms of its
implication for INDIVIDUAL (X,Y) pairs, rather than the MEANS.
For a given X, consider the deviation of Y|X from mu(Y), 
ie.  Y/X - mu(Y).

    mu(Y|X) - mu(Y) = rho [SD(Y)/SD(X)][X - mu(X)]

i.e.

    mu(Y|X) - mu(Y)                   X - mu(X)
    ---------------     =   rho      -----------
        SD(Y)                           SD(X)

Think of the numerator on the left hand side as how far the
X-specific mean mu(Y|X) is above the general or overall mean mu(Y).
This is the average Y distance, of person's with that specific X
value, will be above the general or overall Y mean.  Think of the
denominator on the left hand side as a scaling factor, making the
average derivation into an average Z score for these values.  The
[X - mu(X)] / SD(X) on the right hand side is the corresponding Z
score for that particular value of X.  So the equation can be
written as

    average (Z score for Y|X) = rho . Z score for X

Note how succinctly Galton paraphrased this equation, when
in reference to fig 2 on the web page, he described the rho of
0.67 or so as 

  "The Deviates of the Children are to 
   those of their Mid-Parents as 2 to 3" 

Note that this relationship ONLY HOLDS if the (X, Y) data have
the Gaussian distribution shown in Fig. 6.3.

This situation doesn't apply very often: first, even if we take a
"Naturalistic" (cross-sectional) sample, so that the underlying
distribution of the (X, Y) data is not distorted, this
distribution may not be bivariate Gaussian. Second, EVEN IF the
underlying distribution IS bivariate Gaussian, we may have sampled on the
X's, in such a way as to over- or under-represent certain X values,
so that the (X,Y) distribution in the sample may look quite
different from its "parent".

A good example of this might be the Framingham study, with Y =
blood pressure and X = cholesterol.  It might be that, within a
narrow age range, the (X,Y) values are bivariate Gaussian [or
reasonably close to that], in both the source population and in
the random sample selected. BUT IF the authors had been
interested in just this (X, Y) relationship [they weren't!], they
could have been more efficient and taken EQUAL size samples from
each X = cholesterol category, so as to have a statistically less
noisy estimate of the slope of Y on X.  Now, the distribution of
(X,Y) data in the sample is ARTIFICIAL ("man made") and so
equations 6.4 and 6.6 would no longer match up. Nor would
population equation 6.3 match the sample equation two below it --
since the sampling "distorts" S(X) and -- consequently -- S(Y).

6-4
===

Equation 6.6 is the sample (empirical) analog of equation 6.3
(population, or parameter).

6-5 and Figure 6-5
==================

Misconception number 1 is indeed, in my experience too, quite
common.

Imagine another extreme situation, where say Y = annual salary,
which increased by say 1% per year for the years X = 1990 to
1999.

ie.   X  1990 1991 1992     .....  1999
      Y   100  101  102.1   .....   109.4

The "best" (least squares) straight line fit to these data is the
line

    Yhat = 99.94 + 1.04 years since 1990

with an r-squared of greater than 0.99

[r-squared isn't 1.00 because the Y's follow a slightly curvilinear
upwards patterns].

However most people, would NOT consider a slope of 1% ("compound
increase") or 1.04% ("simple" increase) a LARGE increase!  What
makes the r-squared so large here is the very tight (and
effectively linear) pattern over the 9 years.  i.e. the residuals
from the fitted line are very small, relative to the 9.4 point
increase in salary over the 9 years.

Note also that ONE CAN MAKE THE SLOPE LOOK BIG just by changing
the scale. For example, changing the X scale to 19.90 to 19.99
would change the slope from 1.04 to 104!! and changing the Y axis
from a base of 100 to say a base of $50,000, but learning X =
1990 to 1990, would change the slope to 520.

But remember that slopes have DIMENSIONS.  You can think of this 
as simply horizontally or vertically stretching the rectangle 
containing the graph -- it won't change the correlation, but it 
will the physical slope.

(This next point may equally belong back in section 6-4.)

r is range-dependent!!
======================

To appreciate this, consider the correlation of weight and height
in say just 4 year-olds or in say 3-7 year olds combined.  

Is the correlation higher in the 4-years olds alone or in the 
3-7 year olds combined, or the same in both?

See the graph on www page for answer. See another example under
"Correlations - obscured and artefactual" in my notes on
Notes on M&M Chapters 2 and 9. 

Rose, a British epidemiologist who was eminent in cardiovascular
disease research, used to show a graph showing the
relationship between Y = cardiovascular mortality (measured at a
community level) and X = the hardness of community drinking
water, in a large number of towns in the U.K.  If he restricted
the analysis to English towns, where the range of water hardness
was limited, the X - Y correlation was slight; but if he included
all of the towns in both Scotland and England, [with a now much
bigger range in water quality .... as Scotch drinkers know!] the
correlation is much increased.

See his graph under chapter 6 on the web site. 

The message is that a "signal" (difference in Y's) cannot be seen
over a limited X range.

A word of caution: Although this example nicely makes the point
that a prerequisite for the study of an X - Y relationship is a
decent amount of X variation, one should be careful.  The
gradient in mortality may have much more to do with other
"intakes", such as the scotch (or the ale), or dietary fat!
There is a strong gradient of child mortality from North to South
in Europe. France is a paradoxical exception (outlier).

6-5 point 2 (Fig. 6-6)
=======================

Again, this is a KEY point.  It also emphasizes that it is very
dangerous to judge fitted correlations or slopes strictly on
numerical results. ONE MUST actually LOOK AT THE DATA --- and with
graphs so easy to make nowadays, there is no excuse for not doing
so.

6-6
===

Since the authors wrote the 1st edition of this text, the
preoccupation with statistical tests has given way in part to
focus on confidence intervals.

Moreover, of the many silly statistical tests carried out, the
one in 6-6-1 (testing that the underlying correlation is zero) is
probably one of the sillier ones.  Often, that the correlation
is nonzero is not in doubt; rather the issue is quantifying the
magnitude of the underlying correlation.

It is interesting to watch investigators as they scan pages of
printouts giving correlations for all pairs of variables in their
study.  They frequently become overjoyed at the large
correlations, only to realize that they have misread the printout
-- the correlations are usually shown in one row, and the
associated p-values in the row underneath!  If the sample size is
small, and the correlations not that high, the p-values may well
be higher --- so the point that the "hoping for significance"
investigator mistakes the high p-value (eg. 0.65) for the
correlation.

And if sample sizes are very large, p-values will be extreme even
if the correlations are modest.  In these situations, I have many
times seen the investigator become saddened at the sight of some
many low "correlations" --- when what (s)he was in fact looking
at was the row of p-values!

6-6-1 vs 6-6-2
==============

Notice the different forms of the test statistic in the null and
non-null situations.  The transformation in the non-null case is
for the same reasons we sometimes transform proportions or use
exact methods --- the range of r is restricted (-1 to +1), so if
rho is high (say 0.85) and n fairly small (say 15 or 20), the 
distribution of all possible values of r is bounded by 1, whereas
one will see a lot of r-values far below 0.70. See the nomogram, 
showing rho on the vertical axis, r on the horizontal: my notes on correlation/
regression from 607 describe its use.

The Z transformation is a way of working on a scale where the
possible "transformed r" values have a closer to Gaussian
distribution, and with a variance (SD) that do not depend on
where along the rho axis one is.

6-6-3 CI for rho
================

Notice the (CORRECT!) wording: CI for a PARAMETER!

You might want to look at the nomogram and my 607 notes. This is
also helpful if one wants to know, if one calculates a
correlation from n pairs, how precise the estimate of (i.e how
narrow the CI for) rho will be.

I have not figured out how to use SAS to calculate the 
confidence intervals section in 6-6. 

Problems

Q 14 IMPORTANT Note re: "Method Comparisons"
============================================

This example is used to practice the material in section 6-7-1.
However it should not be taken as an endorsement of correlation
as a way to quantify the performance of a proposed replacement
("easier") medical test for a more complex or more painful or
more costly "gold standard" reference test.  The landmark article
by Altman and Bland (Lancet,1986) explains what is incorrect
about using the correlation coefficient in this circumstance, and
offers a much more informative way to present the data
graphically, and to calculate a simple numerical summary
measuring the "accuracy" of the new method relative to the
reference standard, or of one (imperfect) measurement instrument
with other.

------------------------------------------------------
example of what NOT to do...   (leter to Amer. J Epi)
------------------------------------------------------

RE: "EVALUATION OF TWO FOOD FREQUENCY METHODS OF MEASURING
    DIETARY CALCIUM INTAKE"

Cummings et al. (1) have recently compared the values from two
food frequency methods of estimating dietary calcium intake with
the values derived from seven-day food records. They based most
of their inferences on correlation coefficients (r), the highest
of which was 0.76. Although the authors were somewhat guarded 
in their conclusions, they nevertheless suggested that the food 
frequency instrument could be clinically useful.

As recently pointed out by Duffy (2), the use of correlation for 
comparing methods of measurement is based on the misconception 
that the correlation coefficient is a measure of agreement. It is
in fact only a measure of linear association and gives no direct
information about agreement (3). The simplest way of assessing 
agreement is by considering the mean and standard deviation of 
within-subject differences between the two methods, combined 
with simple graphic display (2, 3).

The (mis)use of correlation for comparing methods cf measurement
is rife in the medical literature. Duffy (2) observed that the
inappropriateness of correlation for comparison methods "should
be borne in mind by authors, editors, and referees in the future,"
an exhortation that must be reiterated.

1. Cummings SR, Block G, McHenry K, et al. Evaluation of two 
   food frequency methods of measuring dietary calcium intake.
   Amer J Epidemiol 1987;126:796-802.

2. Duffy SW. Re: "Seven-day activity and self-report compared 
   to direct measure of physical activity." (Letter). Am J 
   Epidemiol 1986,123:557.

3. Bland JM. Altman DC. Statistical methods for measuring 
   agreemet between two methods of clinical measurement. 
   Lancet 1986;1:307-10.

Douglas G. Altman
Medical Statistics Lnboratory, Imperial Cancer Research Fund
PO Box 123, Lincoln's Inn Fields, London WC2A 3PX, England.
-------------------------------------------------------------

A scanned copy of this paper is available elsewhere
on this webpage, under the title "Bland & Altman".

Q 14 Creating higher correlation by enlarging
     the range of one or both variables
=============================================

This is good example of how one can "enhance" a correlation by
amalgamating the data from 2 subgroups.  The (finger,
venipuncture) hemaglobin values for men might form an "eclipse"
centered on (?,?) while those for women might center on a
different (?,?). The values for both genders combined would then
form a more elongated eclipse, and this thus yield a higher 
correlation coefficient.

You would get the same phenomenon using the (height,weight)
values for men and women separately, and for the combined
genders.