NOTES ON KKMN CHAPTER 1 jh 1999.05.29 1.1 Concepts ============ Footnote The word multivariate is already "taken". For example, one might search for the "structure" among the responses to the 100 items in a psychometric instrument. One might study how many separate dimensions they represent, which items to group together to form different subscales, etc. A second example would be the use of systolic blood pressure, diastolic blood pressure and pulse rate as a 3-dimensional response in a medical research study (analyzed by "Multivariate" techniques such as Manova, Mancova, Multivariate (Hotelling's) T-test, etc.). In our case, the response will be 1-dimensional (scalar). See the article "Appropriate uses of multivariate analysis" in www.epi.mcgill.ca/hanley/c697/ We will try to stay away from the terms "dependent" and "independent" variables. The terms "outcome" or "response" variables are more expressive than "dependent" variable. Likewise we will use "stimulus" variable, or "explanatory" variable or "predictor" variable instead of "independent" variable. The term "independent" presumably has its origins in settings where an investigator has the freedom (independence) to turn knobs to set temperature, humidity, etc and to observe response in some other ("dependent") variable. The term "independent" is ill- suited to a variable like maternal smoking and its role in the etiology of low birth weight. The word "determinant" is more natural, and expressive. It often makes more sense to divide up the so-called independent variables in a multiple regression according to their ROLES. Often, the focus is on one particular determinant (such as maternal smoking here). The contrast(s) (comparison(s)) is(are) in terms of the different levels (at least 2) of this variable. In epidemiology we loosely and generically refer to this as the "exposure" variable. In a psychology experiment, (or a clinical trials) it might be the variable whose levels are being deliberately (experimentally) manipulated in order to observe how much it influences responses. The other variables on the right side of the multiple regression might play the role of(i) "confounding" variables, which need to be "adjusted" for to make the comparison "fairer" (ii) determinants of the responses, which do not confound the comparison, and are of no direct interest but are responsible for a lot of the variation in response; their inclusion removes this unwanted "noise" (iii) modifiers of the stimulus-response relationship. In other research situations, there might not be one determinant that is the primary focus of the analysis. rather, all determinants might be considered of equal status, and the questions might be how many of them, or which combination of them, are minimally sufficient in "predicting" the variation in outcome. OR it might be that one is interested in the incremental predictive value of certain ones. It should be obvious from the titles of the reports under I-2 below what the roles of the different variables are in each instance. 1-1. 3rd paragraph (experiment). ================================ The term "predictor" is clearly not the most appropriate one here. It would be more natural to speak of subjects or units being assigned to treatments or experimental conditions rather than to different levels of a predictor. One doesn't think of a particular treatment as a "predictor". Nor should one think of random allocation as the defining characteristics of an experiment. Rather it is the deliberate manipulation on the part of the investigator IN ORDER TO LEARN about its effect. And, if one wants to maximize the ability to isolate real effects, and avoid artefacts or distortions produced by other influences, random allocation is not necessarily the best (or only) protection. With a small number of subjects, careful control of (or balancing of subjects on) these other influencing variables might have a much bigger pay-off [of course, one can -- and should -- use random allocation in addition!] Key to the distribution between experimental and quasi-experimental studies is who is doing the allocation, and for what purpose. In the experiment, it is the investigator; in the other, it is usually administrative - for example, the assignment of 18 year olds in 1970 in the U.S. to service in Vietnam was determined by lottery (see Moore and McCabe's textbook), but not by researchers, and not for research purposes. Note that the lottery per se doesn't make it an experiment. Paragraph 5. ============ All studies are "observational" (how else can we get data?). It is much better to speak of "experimental" and "non-experimental" studies. Paragraph 6. ("Error"). ======================= This is a very narrow view, and seems to lump "error" (and its negative connotation) with natural variation . Certainty measurements have error, but then also display natural variation. If errors of measurement were eliminated, there would still be the interesting human variation. After all, we can measure birth weight to the nearest 10 grams (down to 1 gram if we want to spend more $$ to do so). In the big scheme of things, this "measurement error" is entirely too simplistic. Even if weight were measured accurately to the gram, one might not even use all of this detail in the data analysis. More interesting is why infants weigh different amounts; and it is not helpful to think of this variation as error, any more than it is to say my short height is an "error". In this course, I will try to replace the term "error" by the term "variation". The irony of this calling all variation "error", and of giving the impression that statistical analysis can do something about it [other than just quantify it, is that in fact, when variables are indeed measured with error, regression analysts often completely ignore these errors. They pretend they don't exist. Thus, they are happy to call the variation in birth weight (Y) error, yet they treat the average number of cigarettes a mother reports as having smoked in the first trimester as an error-free measurement. The regression coefficients obtained from these data are biased, and their standard errors and confidence intervals do not reflect the possibly large "analytic errors" created by sweeping the issue of measurement error under the carpet. Unfortunately, the text ignores this issue. Some treatment of this problem is found in Chapter 4, section 5 (pp 164-166) of the text Applied Linear Statistical Models 4th Ed by J Neter (and 3 other authors, published by Irwin 1996). It is best read after we have dealt with Chapter 5 of KKMN. Good epidemiology texts also dela with the consequences of measurement errors in the X (and Y) variables. Paragraph 7: Organization of chapters. ====================================== Don't worry that we won't cover Chapters 17-20 explicitly; we will cover them implicitly by doing Chapters 5-15. Indeed, the Anova analyses in Chapter 17-20 are just special cases of the regression approach, where indicator (or "DUMMY") variables are used in a regression model to denote levels of a categorical variable. Indeed, it is a pity that the authors wait until Ch 14 to bring in indicator variables; we will work them in gradually (and without as much fuss) much earlier. Likewise, there is no compelling reason (other than a historical one) why Chapter 15 (analysis of covariance) has to stand above as a separate chapter. If we try to focus on the big (unified) picture of regression, then chapter 15 is but a special case. Indeed, in epidemiology it is a very important special case -- it is how we use regression to deal with confounding. Unfortunately, the emphasis in earlier Chapters is on "building" models and on "prediction". In fact, the most important use of regression in epidemiology (non-experimental) is probably for adjustment [ie bias reduction, control of confounding] rather than building "predictor" models. Likewise, once one is comfortable with the regression models in Chapters 5-16, it isn't a huge jump to chapters 23 and 24 (logistic and Poisson regression). The key is not to think of regression as "predicting" INDIVIDUAL responses, but rather as estimating the MEAN response for units having the same values of the "predictor" variables. Try to think of regression as being about two-features of data -- (1) the SYSTEMATIC patterns of (conditional) MEANS and (2) the magnitude of the individual VARIATIONS about these means. In logistic and Poisson regression, we choose to model the systematic variation in not the MEANS (or proportions) per se, but in some function of these means or proportions. Nevertheless, most of the important regression ideas stay the same. I-2 Examples ============ Here are the titles and partial abstracts of some recent research reports in the biomedical literature. Identify the roles of the various variables mentioned in each one. Also browse through the descriptions of some of the datasets in: www.epi.mcgill.ca/hanley/c678/ www.epi.mcgill.ca/hanley/c697/ www.epi.mcgill.ca/hanley/c622/ www.epi.mcgill.ca/hanley/c626/ and do the same exercise. The word "multivariate" on MEDLINE ================================== You might find it interesting to scan the abstracts of the articles that you can identify with the textword "multivariate" and to ask yourself what the purpose of the multivariate analysis was in each study.