Notes on KKMN Chapter 10 1999.05.26 Preamble ======== This chapter is not essential. If you prefer to work with slopes than correlation coefficients, it doesn't add much new. But what correlations do allow us -- that slopes do not -- is to use dimensionless measures of associations --- particularly important if one has no familiarity with, or interest in, the scales on which Y and the X's are measured. The ideas of "partialling out" or removing variation associated with other variables is the same whether one works in the slope scales or the correlation scales. In the biomedical literature, one doesn't see partial correlations reported very often. The impression is that they are more common in the psychology and social sciences literature. 10-1 ==== Nothing new here. But don't bother too much with features 4 and 5. Instead, think of the X's as our choice, not Nature's. 10-2 ==== You can get this table in SAS by running PROC CORR i.e. PROC CORR; VAR Y X1 ... Xk; In INSIGHT, click the Y and X's into "Multivariate Y". In their example, you would have to create X3 = X2_squared first. Their choice of X3, as the square of X2, makes for some trouble, since we would not usually think of having Age_squared without having Age. We will discuss in a later chapter the special issues of powers and products of existing variables, and how to stay out of trouble when using them. 10.3 Multiple Correlation Coefficient R =========================================== I prefer the first of the two versions of equation 10.1 (p. 163), where R(Y | X1, ... Xk) is expressed as a straight correlation between the observed and predicted (fitted) Y values. I am less comfortable with percentages or fractions of variance. I don't know why the authors do not refer to the "SSY - SSE" in the second version as "SS Regression". Moreover, they could have --- just like in simple linear regression -- shown R-squared as the ratio of two average squared amplitudes, i.e. average squared deviation of Yhat from Ybar, and average Squared deviation of Y from Ybar, i.e. average {(Yhat - Ybar)^2} R-square = ------------------------- . average {( Y - Ybar)^2} In INSIGHT it is easy to check numerically that version 1 holds -- just run a correlation between the Y's and the Yhat's. I suspect that they use the second version to actually calculate it. 10-4 Relationship to Multivariate Gaussian Distribution ======================================================= I consider this version to be too restrictive. However, the equation linking mu(Y | X1,X2) to mu(Y) and the other 2 terms also holds -- with mu's replaced by sample averages -- for less restrictive situations. Indeed, we can turn them around so as to get an equation for the difference in average response (Y) in two groups (with the groups defined by say X1 = 0 and X1 = 1) with or without adjustment for the fact that these two groups differ with respect to a confounder X2. This is using multiple regression to adjust for confounding. We will return to this in Chapters 11 and 15. 10-5 Partial Correlation Coefficient ======================================= Whereas the MULTIPLE correlation is between Y and ALL of the X's in question, the PARTIAL correlation is between Y and a single X, with one or more other X's "removed" or "controlled for", or "partialled out". Think of the "zero order"correlations as Y with X, not controlling for any other X's. SAS output for Partial Correlations =================================== Note that the two types of squared partial correlations (pp. 166, bottom) are the analogs of the "variables added in order" and "variables added last" sums of squares in the previous chapter. Note the imaginative and highly self explanatory labels "Type I" and "Type III", just like "Type I SS" and "Type III SS," that statisticians are good at creating (just like type I error and type II error!) Again, I would have preferred if SAS had been a bit friendlier in outputting the partial R's rather than their squares. It seems somewhat circular to have to go through the slope estimates in regression just to get the signs of the partial correlations. Another way to get at the partial correlations directly is via PROC CORR in SAS viz. PROC CORR; VAR Y; WITH X1, X2, X3; PARTIAL X4; would get you the partial correlations of Y with each of X1, X2 and X3, in each case "controlling for" X4. 10-5-1 Tests for Partial Correlations ======================================== No surprise that the tests are the same as for the corresponding beta's! Or, if you prefer, PROC CORR above gives p-values directly. 10-5-2 Partial Correlation & Partial F test ============================================== One shouldn't have any problems with the formula for the square of rho (Y,X1 | X2), if one is comfortable with the "one X" case. If we only had X1, then the formula for rho(Y,X1) is sigma_sq(Y) - sigma_sq (Y | X1) rho-sq (Y, X1) = ------------------------------- sigma-sq(Y) If now we "condition on" X2, which is the equivalent of looking at rho-sq(Y,X1) within narrow "slices" of X2, then the formula goes through directly. So, all one needs to do is write "|X2" after each Y above. Formula 10.2 is derived by substituting sample statistics for population parameters. Note that the parameter version assumes a "trivariate Gaussian" distribution of Y, X1 and X2. 10-5-3 Another formula for r(Y, X|Z) ======================================= Formula 10.3 is a useful version, because it can be derived directly from the simple part wise correlations between Y, X and Z. Moreover, as the authors explain in part at the top of p. 169, one can tell what happens if X is positively, negatively (or not at all) correlated with X. I'm delighted to see the authors use a different letter Z for the "control" or "confounding" variable. In epidemiology, we call r(Y,X) the "crude" correlation between Y and X ("outcome" & "exposure"). r(Y, X|Z) is the "adjusted" correlation. 10-5-4 Partial Correlation as a correlation of two sets of residuals ============================================== This formulation is the same as the multiple regression as a series of simple regressions" concept, which i mentioned in my 607 notes. by the way, i first learned about this in the excellent regression text by Draper and Smith. Table 10-3 (p 171) ================== We use these diagrams a lot in teaching confounding in epidemiology. Recall the analysis Galton did on the correlation between X1 = parents' height and Y=the heights of their offspring. There was another variable Z=gender of the offspring. Galton handled it in an elegant way. Where does it fit in in the 4 cases in Table 10-3? 10-6 and 10-7 ============= Follow if you wish. One doesn't usually see these reported in the biomedical literature.