Notes on KKMN Chapter 6 jh 1999.05.19 See also the article "Thirteen ways to look at the correlation coefficient" Preamble ======== You will notice that in my 607 notes on correlation and regression, I put correlation first, (as do M & M in their 3rd edition) since it is the more "neutral" of the two. The correlation of X1 and C2 (or Y1 and Y2) is the same as the correlation of X2 and X1 (or Y2 and Y1). Not so with regression! It matters which is regressed on which. The book would have done better to define r in terms of pairs of X's (or pair's of Y's) rather than X - Y pairs. Obviously, if the authors define the slope B or (beta_1) of Y on X in chapter 5, then they have to express the equivalent formula for r in terms of b, rather than what to me is the more natural "b in terms of r", namely b = {S(Y) / S(X)} r One way to keep this formula straight is to think of the units (dimensions) each component is expressed in. The b (slope) on the left-hand side is in terms of delta mu(Y) / delta X, or simple Yunits / Xunits. The units for S(Y) / S(X) are again Yunits / Xunits, since the SD of Y and X are in Yunits, and X units respectively. r itself has no units (it is a positive or negative fraction, between +1 and -1). Thus the units on the left side of the equivalence agree with those on the right. The fact that r is dimensionless means that the correlation between say the daily temperature in Vancouver (V) and Montreal (M) is the same whether one city's temperatures are measured in Fahrenheit and the other's in Celsius, both are in Fahrenheit. I could have chosen correlations of heights (in cm or inches) and weights (in lbs or kilograms) to make the same point, but I didn't want to use two variables where one of them (weight) is more likely to be thought of as a "Y" variable and the other (height) a more natural "X" variable. If you have to choose between knowing just r-squared or r, choose r! Otherwise, the direction of the correlation is lost in the square (just like a z statistic for 2 proportions has more information than its square, the chi-square statistic!). 6-2 === It is more helpful to say "an individual with an ABOVE AVERAGE value on one of the two variables, is likely to be ABOVE AVERAGE on the other." This way of thinking about it will help you correctly determine the (approximate!) correlated in the following pairs of variables. Technically speaking, independence of two random variables is a stronger property than a lack of correlation: one can concoct examples where the correlation is zero, yet there is a strong relation. Fig. 6-1(d) is a good example. That's why it is a good idea to speak of a LINEAR relation or association (or the absence thereof). The Greek letter (that looks like the letter p in italics, sans serif) for the parameter (in the population) is pronounced "rho". 2nd complete paragraph, p. 90 ============================= Psychologists have found that statistics students will give different eyeball r's for the same data, depending on how the graph is set up (data crowded into the middle, with lots of white space; data all the way to the limits of the axes; frames around none, 2 or all 4 sides, whether the frame is "landscape" or "portrait" etc.) Sometimes, data pairs are presented in time series from --eg. the x-axis might be calendar time, and the two data items for each month might be the price of a barrel of oil at the well and the price of a litre of gasoline at the gas-station. In this display, it is even more difficult to judge the correlation. See the helpful article (on the class web page) by Chatillon on a way to more objectively and fairly reliably estimate r by eye. 3rd and 4th paragraphs ====================== The diagram in my 607 notes explains this with positive products in the ++ and -- quadrants, and negative products in the +- and -+ quadrants. [the dividers for the quadrants may have drifted in the translation from a Mac Word to a MS Word 6 document!] The correlation discussed in this chapter is Pearson's (product-moment) correlation. There is also, in non-parametric statistics, Spearman's rank correlation, which is obtained by calculating the Pearson correlation on the pairs of ranks rather than the pairs of raw data. It is invariant to monotone transforms -- for example, the Spearman correlation for the data in Figure a on page 61 is 1, whereas the Pearson correlation is sqrt[0.7442] = 0.86 In Figure b, the Spearman and Pearson correlations are 1 and sqrt [0.9983] = 0.99 respectively. 6-3 === Figure 6-3 looks like the roof of Montreal's Olympic Stadium. It might be better to think of it concretely in the context of say the numbers or relative numbers (7) of persons with a certain value of X = cholesterol and Y = blood pressure. Think of the 2-D histogram as like high-rise towers (representing the frequencies) sitting on the different "blocks", where the "north-south" address is a cholesterol category, and the "east-west" address is the blood pressure category. The "tallest" blocks (most populated cholesterol-bp categories) would be in the "downtown", with the shorter high-rise buildings (with fewer persons in these categories) in the "suburbs". As the authors say, this coverage of the bivariate normal distribution is not central to regression, but is simply another justification (if we needed one) for the least squares estimator of the regression line. Equation 6.3 is the theoretical line, in Greek. The equation two down from it [with hats, and with Ybar and Xbar instead of mu(X) and mu(X)] is the estimator. Equation 6-4, and its sample or empirical counterpart (the first equation on p. 93) is usually carried over to regression where the X's do not have a natural Gaussian distribution. It is instructive to rewrite equation 6.3 in terms of its implication for INDIVIDUAL (X,Y) pairs, rather than the MEANS. For a given X, consider the deviation of Y|X from mu(Y), ie. Y/X - mu(Y). mu(Y|X) - mu(Y) = rho [SD(Y)/SD(X)][X - mu(X)] i.e. mu(Y|X) - mu(Y) X - mu(X) --------------- = rho ----------- SD(Y) SD(X) Think of the numerator on the left hand side as how far the X-specific mean mu(Y|X) is above the general or overall mean mu(Y). This is the average Y distance, of person's with that specific X value, will be above the general or overall Y mean. Think of the denominator on the left hand side as a scaling factor, making the average derivation into an average Z score for these values. The [X - mu(X)] / SD(X) on the right hand side is the corresponding Z score for that particular value of X. So the equation can be written as average (Z score for Y|X) = rho . Z score for X Note how succinctly Galton paraphrased this equation, when in reference to fig 2 on the web page, he described the rho of 0.67 or so as "The Deviates of the Children are to those of their Mid-Parents as 2 to 3" Note that this relationship ONLY HOLDS if the (X, Y) data have the Gaussian distribution shown in Fig. 6.3. This situation doesn't apply very often: first, even if we take a "Naturalistic" (cross-sectional) sample, so that the underlying distribution of the (X, Y) data is not distorted, this distribution may not be bivariate Gaussian. Second, EVEN IF the underlying distribution IS bivariate Gaussian, we may have sampled on the X's, in such a way as to over- or under-represent certain X values, so that the (X,Y) distribution in the sample may look quite different from its "parent". A good example of this might be the Framingham study, with Y = blood pressure and X = cholesterol. It might be that, within a narrow age range, the (X,Y) values are bivariate Gaussian [or reasonably close to that], in both the source population and in the random sample selected. BUT IF the authors had been interested in just this (X, Y) relationship [they weren't!], they could have been more efficient and taken EQUAL size samples from each X = cholesterol category, so as to have a statistically less noisy estimate of the slope of Y on X. Now, the distribution of (X,Y) data in the sample is ARTIFICIAL ("man made") and so equations 6.4 and 6.6 would no longer match up. Nor would population equation 6.3 match the sample equation two below it -- since the sampling "distorts" S(X) and -- consequently -- S(Y). 6-4 === Equation 6.6 is the sample (empirical) analog of equation 6.3 (population, or parameter). 6-5 and Figure 6-5 ================== Misconception number 1 is indeed, in my experience too, quite common. Imagine another extreme situation, where say Y = annual salary, which increased by say 1% per year for the years X = 1990 to 1999. ie. X 1990 1991 1992 ..... 1999 Y 100 101 102.1 ..... 109.4 The "best" (least squares) straight line fit to these data is the line Yhat = 99.94 + 1.04 years since 1990 with an r-squared of greater than 0.99 [r-squared isn't 1.00 because the Y's follow a slightly curvilinear upwards patterns]. However most people, would NOT consider a slope of 1% ("compound increase") or 1.04% ("simple" increase) a LARGE increase! What makes the r-squared so large here is the very tight (and effectively linear) pattern over the 9 years. i.e. the residuals from the fitted line are very small, relative to the 9.4 point increase in salary over the 9 years. Note also that ONE CAN MAKE THE SLOPE LOOK BIG just by changing the scale. For example, changing the X scale to 19.90 to 19.99 would change the slope from 1.04 to 104!! and changing the Y axis from a base of 100 to say a base of $50,000, but learning X = 1990 to 1990, would change the slope to 520. But remember that slopes have DIMENSIONS. You can think of this as simply horizontally or vertically stretching the rectangle containing the graph -- it won't change the correlation, but it will the physical slope. (This next point may equally belong back in section 6-4.) r is range-dependent!! ====================== To appreciate this, consider the correlation of weight and height in say just 4 year-olds or in say 3-7 year olds combined. Is the correlation higher in the 4-years olds alone or in the 3-7 year olds combined, or the same in both? See the graph on www page for answer. See another example under "Correlations - obscured and artefactual" in my notes on Notes on M&M Chapters 2 and 9. Rose, a British epidemiologist who was eminent in cardiovascular disease research, used to show a graph showing the relationship between Y = cardiovascular mortality (measured at a community level) and X = the hardness of community drinking water, in a large number of towns in the U.K. If he restricted the analysis to English towns, where the range of water hardness was limited, the X - Y correlation was slight; but if he included all of the towns in both Scotland and England, [with a now much bigger range in water quality .... as Scotch drinkers know!] the correlation is much increased. See his graph under chapter 6 on the web site. The message is that a "signal" (difference in Y's) cannot be seen over a limited X range. A word of caution: Although this example nicely makes the point that a prerequisite for the study of an X - Y relationship is a decent amount of X variation, one should be careful. The gradient in mortality may have much more to do with other "intakes", such as the scotch (or the ale), or dietary fat! There is a strong gradient of child mortality from North to South in Europe. France is a paradoxical exception (outlier). 6-5 point 2 (Fig. 6-6) ======================= Again, this is a KEY point. It also emphasizes that it is very dangerous to judge fitted correlations or slopes strictly on numerical results. ONE MUST actually LOOK AT THE DATA --- and with graphs so easy to make nowadays, there is no excuse for not doing so. 6-6 === Since the authors wrote the 1st edition of this text, the preoccupation with statistical tests has given way in part to focus on confidence intervals. Moreover, of the many silly statistical tests carried out, the one in 6-6-1 (testing that the underlying correlation is zero) is probably one of the sillier ones. Often, that the correlation is nonzero is not in doubt; rather the issue is quantifying the magnitude of the underlying correlation. It is interesting to watch investigators as they scan pages of printouts giving correlations for all pairs of variables in their study. They frequently become overjoyed at the large correlations, only to realize that they have misread the printout -- the correlations are usually shown in one row, and the associated p-values in the row underneath! If the sample size is small, and the correlations not that high, the p-values may well be higher --- so the point that the "hoping for significance" investigator mistakes the high p-value (eg. 0.65) for the correlation. And if sample sizes are very large, p-values will be extreme even if the correlations are modest. In these situations, I have many times seen the investigator become saddened at the sight of some many low "correlations" --- when what (s)he was in fact looking at was the row of p-values! 6-6-1 vs 6-6-2 ============== Notice the different forms of the test statistic in the null and non-null situations. The transformation in the non-null case is for the same reasons we sometimes transform proportions or use exact methods --- the range of r is restricted (-1 to +1), so if rho is high (say 0.85) and n fairly small (say 15 or 20), the distribution of all possible values of r is bounded by 1, whereas one will see a lot of r-values far below 0.70. See the nomogram, showing rho on the vertical axis, r on the horizontal: my notes on correlation/ regression from 607 describe its use. The Z transformation is a way of working on a scale where the possible "transformed r" values have a closer to Gaussian distribution, and with a variance (SD) that do not depend on where along the rho axis one is. 6-6-3 CI for rho ================ Notice the (CORRECT!) wording: CI for a PARAMETER! You might want to look at the nomogram and my 607 notes. This is also helpful if one wants to know, if one calculates a correlation from n pairs, how precise the estimate of (i.e how narrow the CI for) rho will be. I have not figured out how to use SAS to calculate the confidence intervals section in 6-6. Problems Q 14 IMPORTANT Note re: "Method Comparisons" ============================================ This example is used to practice the material in section 6-7-1. However it should not be taken as an endorsement of correlation as a way to quantify the performance of a proposed replacement ("easier") medical test for a more complex or more painful or more costly "gold standard" reference test. The landmark article by Altman and Bland (Lancet,1986) explains what is incorrect about using the correlation coefficient in this circumstance, and offers a much more informative way to present the data graphically, and to calculate a simple numerical summary measuring the "accuracy" of the new method relative to the reference standard, or of one (imperfect) measurement instrument with other. ------------------------------------------------------ example of what NOT to do... (leter to Amer. J Epi) ------------------------------------------------------ RE: "EVALUATION OF TWO FOOD FREQUENCY METHODS OF MEASURING DIETARY CALCIUM INTAKE" Cummings et al. (1) have recently compared the values from two food frequency methods of estimating dietary calcium intake with the values derived from seven-day food records. They based most of their inferences on correlation coefficients (r), the highest of which was 0.76. Although the authors were somewhat guarded in their conclusions, they nevertheless suggested that the food frequency instrument could be clinically useful. As recently pointed out by Duffy (2), the use of correlation for comparing methods of measurement is based on the misconception that the correlation coefficient is a measure of agreement. It is in fact only a measure of linear association and gives no direct information about agreement (3). The simplest way of assessing agreement is by considering the mean and standard deviation of within-subject differences between the two methods, combined with simple graphic display (2, 3). The (mis)use of correlation for comparing methods cf measurement is rife in the medical literature. Duffy (2) observed that the inappropriateness of correlation for comparison methods "should be borne in mind by authors, editors, and referees in the future," an exhortation that must be reiterated. 1. Cummings SR, Block G, McHenry K, et al. Evaluation of two food frequency methods of measuring dietary calcium intake. Amer J Epidemiol 1987;126:796-802. 2. Duffy SW. Re: "Seven-day activity and self-report compared to direct measure of physical activity." (Letter). Am J Epidemiol 1986,123:557. 3. Bland JM. Altman DC. Statistical methods for measuring agreemet between two methods of clinical measurement. Lancet 1986;1:307-10. Douglas G. Altman Medical Statistics Lnboratory, Imperial Cancer Research Fund PO Box 123, Lincoln's Inn Fields, London WC2A 3PX, England. ------------------------------------------------------------- A scanned copy of this paper is available elsewhere on this webpage, under the title "Bland & Altman". Q 14 Creating higher correlation by enlarging the range of one or both variables ============================================= This is good example of how one can "enhance" a correlation by amalgamating the data from 2 subgroups. The (finger, venipuncture) hemaglobin values for men might form an "eclipse" centered on (?,?) while those for women might center on a different (?,?). The values for both genders combined would then form a more elongated eclipse, and this thus yield a higher correlation coefficient. You would get the same phenomenon using the (height,weight) values for men and women separately, and for the combined genders.