NOTES ON KKMN CHAPTER 4 jh 1999.05.29 4-1 === I prefer to think of regression as a statistical tool for evaluating the relationship of one or more "Xs" to the X-SPECIFIC MEANS OF the corresponding Y variable. Example of a variable C which might not have been considered in the design i.e. in deciding on whom (in terms of Xs) one is to measure of observe Y: X2 = age, in a study where X1 = number of years worked in a noisy environment and Y = hearing loss. Clearly, if one had a choice at the subject selection stage (i.e. if one were given/knew the values of the two X-variables for each potential subject, and had some choice in which subjects were going to be selected for measurement of Y, one would not select them at random from the (X1,X2) distribution. The reason why multiple regression is "equally applicable" in more controlled, or even experimental situations [e.g. where persons might be randomly assigned to various levels of X1] has to do with the ability of X2 to reduce/explain/remove the variation in Y that is seen in subjects who were at the same level of X1. Application 2: ============== More often than not in medicine and epidemiology, we are not in the strictly "predictive" mode. And even when we are, we are really ESTIMATING what the MEAN level of Y is in persons with a certain "X" profile. Then, we use this same estimate of the mean for all persons with the same X profile. The simplest example might be if Y = height of a randomly chosen adult and X = X1 = gender (k = 1 "predictor" variable). Then our best predictor of the height of a particular female would be the mean height of all females in the dataset. A clearer example where the objective is prediction rather than description might be the estimation of fat thickness when trying to reach muscle with a certain length injection needle. As X's one might use height and weight (or body mass index) and gender. Here, as in all predictors, there isn't much point in just knowing the AVERAGE distance to the muscle for persons of a certain measurable profile; one who needs to know how VARIABLE is this distance in persons with the same profile. Indeed, one should characterize regression equation or models as having not one but TWO distinct parts (1) the equation that predicts the MEAN level for a given X profile, and (2) an estimate of the VARIATION around the mean. If the estimation of the mean itself involves considerably statistical uncertainty, then estimates of the range of variation in individual Y values have both a statistical uncertainty component and the natural component due to the fact that, even with the same X profile, "everyone is built slightly differently". Application 3. ============== I am pleased that the authors make this distinction between the roles of the X's and the C's. To a computer that fits a regression equation, the X's and the C's are all alike -- they are both on the right hand side of the regression equation, as "independent variables". IT IS UP TO THE USER TO TREAT THEM AS X's AND C's! It might be good to think of some NON-REGRESSION ways of "controlling for" the C's. For example, if one had enough data, one could "make all other factors (the C's) equal" by splitting the dataset into subsets where persons all have the same C profile, then estimate theeffect of say X1 within each subset, and take an average of theestimates of the X1 effect. Indeed one can think of multiple regression with the C's "in the model" as a "synthetic" or "poor-person's" analogue of the restriction/subset method. As will be reiterated many times, there is "no free lunch" with regression models. Application 4: ============== We will also emphasize that the ranking of variables on their descriptive or predictive ability always hasto be done IN A GIVEN CONTEXT i.e. it has to be specified WHAT OTHER VARIABLES (X's) have already been used/taken into account. AND, the decision as to what [C's] to "control" for before considering these different X's, and whether there is a natural or logical order to the X's, is one for the user to take. Application 5: ============== Note here the term MATHEMATICAL (as opposed to BIOLOGICALLY DRIVEN). Note that SEVERAL different combinations of the X's may do equally well. The emphasis here seems to be prediction rather than understanding of the components (imagine just naming the variables X1, X2, etc and giving the task to someone who know nothing about the Y or X variables. Applications 6 and 7. ===================== If we drop the C1 in application 6, then the FOCUS in Applications 6 and 7 IS THE SAME! The difference is that it is explained in plain (nonstatistical) language in 6, andthat "gender" in application 6 has the same role as smoking (= X2) in application 7. I will plead in this course that we "say what we mean" using as non-technical and expressive language as we can. The word "interaction" and the term "interactive-effects" should be discouraged, along with terms like a "normal" distribution. As we will see, a better word to describe the different smoking (X1 - Blood Pressure relationship in males and females is to say the relationship is "modified" by gender. Or if one wants to be folksy, but expressive, why not speak of "different slopes for different folks"! Another way to think of this situation of effect modification (or "interaction") as one where one needs a separate "story" for each level of the modifier variable. In our example of what size needle is needed to reach muscle and how this relates to say body mass index, do you think you would observe the same relationship in men and women? Note that statisticians tend to use the term interaction and epidemiologists the term "effect modification" for the same concept. Application 8: ============== I'm not sure! I see a difference between this and say Application 2 and 3. It doesn't actually make sense to speak of a VALID and precise estimate of say a regression coefficient for X1 UNLESS one specifies exactly which other X (and/or C) variables are included in the model. In other words, the B1 in the model E( Y | X1, X2, X3) = B0 + B1 * X1 + B2 * X2 + B3 * X3 is a legitimate object of enquiry. But so is the "B1" in the model E( Y | X1, X2) = B0 + B1 * X1 + B2 * X2. BUT, the B1 and B2 in the second model DO NOT HAVE THE SAME MEANING AS the B1 and B2 in the first model. Indeed, that is why I don't like to write E(Y) = some function of the X's. Instead, I specify (by putting them after the vertical character |) what variables are being considered simultaneously. In some books, authors might use B1, B2 and B3 in model 1, and B_prime_1 and B_prime_2 in model 2, just to drive home this fact that ALL REGRESSION COEFFICIENTS ARE CONTEXT-DEPENDENT! 4-2 Association versus causality ================================ The authors missed a chance to mention a few of the many amusing nonsensical correlations that have been used by teachers over the years to drive home the point that association isn't causation. Moore has some good ones in his book Statistics: Concepts and Controversies, and there are several more in the video on correlation in the Against All Odds series. The authors also missed a chance, after their "a statistically significant association does not establish a causal relationship" statement on p. 36 to make the converse statement: causal associations can go undetected (because of confounding, measurement errors, etc...). 4-3 === One needs to be careful in not overstating what the coefficients of a regression model mean. it may be tempting to say that CHANGES in X are related to CHANGES in Y." For example, in the Busselton data set, collected cross-sectionally, (see under Chapter 5 on the web page) there is a clear negative regression coefficient (and, thus, correlation coefficient) when (in adults) Y = height is regressed on X = Age. It could be that some of this is due to osteoporosis, but it is stronger than that, and in both genders. Some of it is surely a generational effect -- older subjects in the study grew up in the late 1800's and early 1900's. (Of course, it might also be that taller people don't live as long as shorter people!). In this example, and indeed in general -- unless one has truly longitudinal data -- it is much more accurate to say "HEIGHTS ARE SHORTER IN OLDER PEOPLE" than to say "as well get older we get shorter" or "as we change in age, we change in height". i.e. if the investigator didn't actually "change" the independent variable, we shouldn't speak of changes in X in relation to "changes" in Y. Rather than introducing the appearance of a "dynamic" situation, emphasize the "static" (between different persons at the same time) nature of the relationship by saying differences between persons with respect to X were associated with differences with respect to Y. On the same topic, what do you think is the explanation for the relationship seen in the report "why do older men have big ears?" (also under Chapter 5 on web page). Again, I take the opportunity to try to get away from statisticians' preoccupation with the word "error". While it is true that the main reasons for an imperfect relationship one would observe between say barometric pressure and boiling point of water is errors in measurement, the reasons why persons of the same age do not all have the same blood pressure (or the same height) are not primarily due to "error" in the measurement of blood pressure or height or age. There simply is natural variation. Of course, one might think of these unexplainable variations as "errors" on our part in not being able to understand why they occur. The authors acknowledge individual variability towards the end of the section.