from Annual Review of Public Health 1983. 4:155 -180 APPROPRIATE USES OF MULTIVARIATE ANALYSIS James A. Hanley Department of Epidemiology and Health, McGill University, Montreal, Quebec, Canada H3A 2A4 (name and address have since changed) INTRODUCTION Comparison of the articles in today's biomedical literature with those of twenty years ago reveals many changes. In particular, there seem to have been large increases over time in three indices: the number of authors per article, the number of data-items considered, and the use of multivariate statistical methods. While cause and effect among these three indices is unclear, there is little doubt that the growth in a fourth factor, namely, computing power and resources, has made it much easier to assemble larger and larger amounts of data. Packaged collections of computer programs, driven by simple keywords and multiple options, allow investigators to manage, edit, transform, and summarize these data and fit them to a wide array of complicated multivariate statistical "models." In addition to making it easy for the investigator to include a larger number of variables in otherwise traditional methods of statistical analysis, the increased speed and capacity of computers have also been partly responsible for the new methods being developed by contemporary statisticians. For example, some of the survival analysis techniques discussed below can involve several million computations. How do these trends in the availability and use of multivariate statistical methods affect the health researcher who must decide what data to collect and how to analyze and present them? How does the reader of the research report get some feeling for what the writer is attempting to do when he uses some of these complex-sounding statistical techniques! Are these methods helping or are they possibly confusing the issue? Unfortunately one cannot look to one central source for guidance about these newer methods. Descriptions of many of them are still largely scattered in the (often highly technical) statistical literature or else presented in monographs in which the connections to other related techniques may not be very evident. Moreover, the reader is often not interested in references to the technical intricacies of maximum likelihood equations, to the methods of solving them, or to the computer program or package used to perform the calculations; rather he is worried about what the technique is attempting to do, what the parameters mean, and whether the assumptions and conclusions are appropriate. The plan of this chapter then is not so much to review all of the recent developments in statistical methodology, but rather to use examples from the literature (a) to give an overview of what multivariate analysis is all about, (b) to describe, in general terms, what it can and cannot be expected to do, and (c) to discuss in a little more detail some newer techniques, as well as some that were developed some time ago but are only now becoming popular, namely (i) logistic regression, (ii) log-linear models for multiway contingency tables, (iii) proportional hazards models for survival data, and (iv) discriminant analysis. MULTIVARIATE ANALYSIS: AN OVERVIEW Scope The term multivariate analysis has come to describe a collection of statistical techniques for dealing with several data- items in a single analysis. Although authors differ about where to draw exact boundaries, for example whether multiple regression is a univariate or multivariate technique, it is more a matter of semantics than it is of substance. I follow here the convention of others (10, 28, 33, 43) and define any analysis that involves three or more variables simultaneously as "multivariate." As such, the term multivariate analysis encompasses everything except confidence intervals, chisquare tests for two-way contingency tables, l-tests (unpaired), one-way analysis of variance, and simple correlation and regression. It includes a huge variety of techniques, since even with just three variables, there are a large number of possibilities (Table 1). The method of analysis depends heavily on whether one is interested in interrelationships or in comparisons, and on whether variables are qualitative or quantitative. The most I can do in this short space is to give a brief roadmap, along with pointers to helpful descriptions or examples. In many situations there will not be one single best method of analysis. As Bishop et al (10) point out, multivariate analysis should be thought of as a "codification of techniques of analysis, regarded as attractive paths rather than straightjackets, which offer the scientist valuable directions to try." Table I A taxonomy of parametric statistical methods Response variable(s) Univariate Multivariate ============ ================ Stimulus Discrete Continuous Discrete Continuous variable(s) -------- ---------- -------- ---------- [1] [2] [3] [4] Univariate ========== Discrete Contingency t-test Multidimensional Discriminant -------- table contingency table analysis 1-way of analysis of logistic variance regression (Anova) Continuous Logistic Correlation Multivariate ---------- regression Discriminant Simple analysis regression Multivariate ============ Discrete Multi- Multi-way Multi- Multivariate -------- dimensional Anova dimensional anova contingency contingency (Manova) table table Continuous Logistic Partial Multivariate ---------- regression correlation regression Discriminant Multiple Canonical analysis regression analysis Mixed Logistic Analysis of Multivariate ----- regression covariance regression (Ancova) Discriminant Canonical analysis analysis Types of analyses Multivariate statistical techniques may be conveniently divided into those in which the variables involved (a) are all of "equal status" or (b) fall naturally (or with some gentle pushing) into two sets, those which are influenced (response variables) and those which influence (stimulus variables). In the first group of techniques, which includes Principal Components Analysis, Factor Analysis, and Cluster Analysis, the emphasis is on the internal structure of the data-items in a single sample. Principal Components Analysis (PCA) asks whether a large number of quantitative data items on each subject can be combined and reduced to a single (or at most a few) new variables (principal components) without losing much of the original information. In other words, the aim is to describe the subjects in terms of their scores (weighted sums of the original variables) on a much smaller number of new variables. These new variables (components) are built to be uncorrelated with each other, so as to avoid any redundancy. Also, they are arranged in decreasing order of "information" so that subjects are furthest apart from each other on the first component, less far apart on the second, and so on. If the total information in the original variables is "compressible," the subjects will not vary very much on the latter components, and these can be discarded as redundant. Theoretically, since there are as many principal components as there are original variables, retaining them all permits one to reproduce the original data. An example in which the first principal component captured 67% of phenotypic variance in a population and was then used as a (univariate) index of overall body size in all subsequent analyses can be found in (11). Factor Analysis (FA) asks whether subjects' quantitative responses on a large number of items and the patterns or correlations among these responses are "explainable" by thinking of each item or variable as measuring or reflecting a different mix of a smaller number of underlying "factors" or "traits" or "dimensions." As originally conceived, it differs from PCA in a number of ways. Whereas PCA "constructs" new variables from already observed ones, FA goes in the other direction, "reconstructing" the observed variables from latent ones. This distinction may have been too subtle and has largely evaporated; moreover, most computer packages use principal components as one way of extracting factors. Second, FA usually assumes that although factors are translated into variables by a "mixing formula" that is common to all subjects, variables will also contain some variation that is unique to each subject. Third, whereas PCA is more a data-reduction technique, FA seeks actually to understand and label the various "factors." Fourth, unlike PCA, FA does not necessarily produce unique answers. Indeed, there are many methods of factor analysis. FA techniques are used primarily to explore relationships and to reduce the dimensionality of a data set. They serve more for instrument building and index construction than as direct analytic tools. However, although they are closely associated in psychology with establishing construct validity, at least one author (40) considers them generally inappropriate for developing health indices. These techniques have been somewhat more useful when the context is of a physical nature, such as in studying air pollution patterns (35), but even then, there are difficulties (5). The few published examples of FA in epidemiology and public health have either concluded the obvious or concluded nothing at all. The same seems to hold true for their use in the medical literature (28). By far the majority of the applications of multivariate statistical methods in the health sciences are of the second kind, where one or more variables serve as "outcomes" or "responses" or "target variables" (28), and others serve as "predictors" or "explanatory" or "carrier" (48) variables. These two sets of terms are gradually replacing the older and quite misleading terms, "dependent" and "independent" variables. Some authors subdivide the explanatory variables further into those of primary interest ("study variables") and those of a "disturbing" or "confounding" or "nuisance" nature; I return to this subdivision below. The main types of techniques for dealing with stimulus-response studies are presented in Table 1, in the form of a multiway grid, according to whether the stimulus and response variable(s) (rows and columns, respectively) are one or many and according to whether they are all recorded on continuous measurement scales, or are all categorical (discrete), or a mixture of both. It is worth dwelling for a moment on a number of contrasts between methods for analyzing a single (univariate) response that is "measured" on a continuous scale (column 2) and those for a corresponding response that is discrete (column 1). 1. Methods for analyzing a continuous response have been in existence for considerably longer (the principle of least squares for fitting a regression line dates back at least two centuries; the newest technique, analysis of covariance, is at least 50 years old). 2. These methods tend to choose parameters and judge the amount of variation explained by various factors using easily understood "distance" criteria such as least squares; in other words, they keep the analysis in the same scale or "metric" that the actual observations were measured on; by contrast, methods for analyzing a discrete response tend to measure "distance" and "fit" using a probability or "likelihood" scale (likelihood is defined as the probability, calculated after the fact, of observing the data values one did). Although the method of fitting parameters to maximize the likelihood is in no sense inferior (if anything it is generally superior from a technical standpoint), it is easier for readers to comprehend changes in it-squared than changes in a log-likelihood! 3. Regression equations for a continuous response are usually linear, involving additive terms, and can be fitted from simple summary statistics, whereas those for a discrete response are often nonlinear, and need to be fitted iteratively with several passes through the data. 4. Estimates from these nonlinear regressions tend to have skewed sampling distributions, giving rise to confidence intervals that are not symmetric. The odds ratio used in epidemiologic studies is a case in point. Fortunately, it is often possible to work in a scale (e.g. log) in which the confidence interval will be of a simpler, symmetric, shape and to change back to the desired scale at the finish. As can be seen from Table 1, multiway contingency tables, logistic regression, and discriminant analysis all play dual functions: they can be used to analyze either a single response variable and several stimuli or several responses and a single stimulus. Indeed, as discussed below, this ability to reverse a "multiple response, single stimulus" situation and cast it into a more traditional and more workable "one response, multiple stimuli" regression framework is key to handling multiple response data. As one proceeds to treat several response variables and several stimulus variables simultaneously, the level of complexity increases considerably: all but the few with e-dimensional vision are quickly lost. As a result, even though computer programs are available for them, the two "doubly-multivariate" techniques, multivariate regression and multivariate analysis of variance (Column 4, Table 1), are seldom used. Instead, investigators try first to construct a "univariate" response and then relate this to the several stimulus variables. MULTIVARIATE ANALYSIS: PURPOSES In this section I discuss the Why of multivariate techniques. Although there are many different techniques, they share a number of common aims and a common underlying philosophy. Of course, they also have many of the same pitfalls; I discuss some of these below. It is difficult to discuss multivariate techniques without also discussing the concept of statistical "models." It sometimes helps to think of these models as comprising two parts, one that is deterministic (dealing with the expected structure, almost like a "law") and one that is stochastic (dealing with random variation). This first part will be of a more global nature, describing what should happen. It might describe how two chemical agents act together on a host or how a lung grows in volume as it grows in linear dimensions; it might be based on or summarize a psychological or sociological theory; or it might be a rough straight-line or curvilinear pattern seen in the data, and which one wants to follow up. This "structural" part of the overall statistical model can be thought of as describing the systematic variations or pattern one would expect in a body of data. Although it is usually described in explicit mathematical equations with coefficients, powers, and the like, it does not have to be so precise. For example, the model might be: "the dose response relationship has no threshold," or "the underlying curve is expected to be concave," or "the risk of cancer will vary with age and be different in exposed and nonexposed groups, but the risk of cancer among the exposed relative to that among the nonexposed will remain the same over all ages." The other part of the model, which some would regard as the probabilistic element, deals with the deviation of the observed data from the postulated pattern. It is often difficult, however, to separate the two parts of the overall model, since it is not clear where prior knowledge (pattern) ends and ignorance (unexplained variation) begins, i.e. whether aberrations are observed because the postulated pattern is a poor one (lack of fit) or because of some other reason. Although this separation into systematic and random components, i.e. into signal and noise, is often used for responses that are recorded on a continuous scale, it is done much less frequently for binary responses. One learns very early in linear regression to think of both the systematic (the straight line) and the random (the scatter of the individual points from the line). In a binary regression, one still thinks of a systematic line (possibly "s-shaped" such as a probit or logit curve) but seldom stops to think about the noise about this curve. Part of the reason for not doing so is that the curve is fitted using likelihood, rather than distance, as the metric and part is that the variation is binary, not continuous. The virtue of this "systematic plus random" paradigm has been recently illustrated in the Generalised Linear Interactive Modelling (GLIM) computer program (6): the program "generalizes" to a wide variety of continuous and binary response regressions by using different probabilistic models (Gaussian, Binomial, Poisson, etc.) and different "link functions" for changing the systematic portion of the model from straight line to s-shaped and so on. GLIM points out that in fact there is a "distance" minimization intrinsic to the method of Maximum Likelihood. With this preamble, I now go on to discuss, via examples where possible, the main aims and uses of multivariate statistical techniques and models. We see four main purposes: 1. to summarize, to smooth out, to see patterns 2. to make comparisons fair, to compare like with like 3. to make comparisons clear, to remove noise 4. to study many factors at once, to explain variation. Purpose 1: To Smooth Out, to See the Forest From the Trees How might one investigate whether and in what way breast cancer incidence rates have changed over time, using the available incidence data from 1935 to 1980 collected by the Connecticut tumor registry? This is an example of a single target variable, binary in nature (cancer or not), and the influence of two "stimulus" variables, age and year of birth. Suppose we know the numbers of cancers in each of nine five-year periods from 1935 to 1980 for each of 12 five-year age groups, along with numbers at risk in each of these 9 X 12 = 108 "cells." As a first step, one could plot the 108 observed age specific incidence rates against age and use lines of different colors to connect together the data points to form age-specific incidence curves for the different birth cohorts. Some of these plots, derived from the data published in Reference (60), are given in Figure 1 (left); they show that although there seem to be cohort effects, it is difficult to measure them very precisely from these "raw" data points. Most would believe that the jagged pattern of straight-line segments has no special meaning, and would think of it only as noise that is obscuring the "real" underlying pattern. They would prefer instead a series of "smoother" incidence plots, one for each birth cohort. These systematic "curves" could be produced by smoothing each one by eye, but doing so would ignore two considerations: first, the rates are calculated from numerators and denominators of varying stability (something the eye looking at a data point cannot see) and, second, if rates vary smoothly across age, they probably also do so across cohorts. Thus, one would need to smooth in two directions at once. This could be done by postulating a single "parent" plot, consisting of 12 points (left unsmoothed to begin with) and specifying that the plots for the separate cohorts are to be obtained by multiplying the parent plots by separate proportionality factors. Admittedly, the task is too complicated to perform manually, but that is hardly an obstacle. This "model-fitting" serves a number of purposes. 1. It produces more realistic plots, and uses many fewer numbers or "parameters" to do so (for the entire dataset, there would be 20 cohort parameters and 12 age parameters). 2. It draws the eye away from the randomness (which should be binomial or Poisson around each fitted point) and toward the pattern, in the same way that an image becomes clearer the further away one stands from its rough grain. The raw plots generated from the earliest and latest cohorts are based on fewer data points (age groups) and are the most diflicult to judge, whereas the corresponding synthetic plots are generated from parameters that were estimated from the entire data set. This concept of borrowing strength from neighboring data points is a central one in multivariate analysis. To some, the idea that it takes 20 + 12 = 32 numbers to describe 20 plots is still unappealing. Surely, they might argue, the parent plot (12 parameters) is not in reality so complicated that it could not be described by a truly smooth, two or three parameter curve or possibly by separate curve segments for pre- and post-menopause. Likewise, they would consider it quite likely that the 20 proportionality factors by which this incidence curve changes from cohort to cohort themselves form a smoothly changing series that could be described by many fewer parameters. Others would argue that one should "leave well enough alone" and that any further smoothing or modeling might do more harm than good. In this example, with the relatively large amount of data, the additional reduction might indeed be unnecessary; however, had the data been scarcer, it is likely that the further smoothing would have been required. There are two more serious objections to the approach just described. First, for any one cohort, the entire parent curve is multiplied through by the same value. This does not allow for cohort effects that are age-specific, e.g. changes in the age at which women in different cohorts completed their first full-term pregnancy might affect the risk of premenopausal breast cancer differently than they would the risk of postmenopausal cancer. This is an example of what statisticians call an interaction: an effect of one factor (age) that is not constant across different values or levels of another (year of birth). Second, the actual goodness of fit of the smoothed curves to the raw data points needs to be evaluated. Before it is, any other expected or suspected patterns can be built into the fitted curves (provided that there are not so many assumptions and exceptions that one ends up with almost as many parameters as data points) and their "fit" tested by examining whether in fact the fitted curves come closer to the raw data points than before, and whether the discrepancies (residuals) are more or less haphazard and unexplainable. See (51) for a nice account of the use of regression models in studying regional variations in cardiovascular mortality. As already mentioned, the assumption of smoothness and of orderly patterns of change is a central one in multivariate analysis. It stems from the belief (or maybe just the hope) that nature is basically straightforward, and that if there are no good biologic or other reasons to the contrary, relationships tend to be linear rather than quadratic, quadratic rather than cubic, etc. [For a description of this principle of "Occam's Razor," see Ref. (54).] In the breast cancer example just described, however, the changes in some possible risk factors have been "man-made" and more sudden, e.g. world wars, shifts in childbearing habits, oral contraceptives, etc, and it may indeed be some sudden changes in incidence (as it was with liver cancer) that alert us to newly introduced causative (or protective) agents. Purpose 2: To Make Comparisons Fair The majority of analytic studies involving humans are of an observational, rather than experimental, nature. As a result, when one compares responses of one group with those of another, the fundamental scientific principle of holding all other factors constant or equal may be violated. Consequently, differences (or nondifferences) in responses may be caused by differences (imbalances) in factors that cannot be controlled experimentally, rather than by the basic variable (groups) under study. Such variables, referred to as "confounding," "disturbing,' or "extraneous" by various authors, can, if ignored, have insidious effects. For example, male and female applicants had similar acceptance rates in each of the various faculties at Berkeley, yet the crude overall (schoolwide) acceptance rate for females was considerably lower (9) because females were more likely to apply to those faculties for which the acceptance rates were lower. This artifact is referred to as Simpson's Paradox, and is always a possibility in observational studies. Although standardization for imbalances (e.g. in age or sex), used to put comparisons of rates on a fair footing, is one of the oldest epidemiologic tools, it is sometimes ignored. A particularly distressing example is the recent controversy in the US and Britain regarding possible cancer-causing effects of water fluoridation, based on findings that cancer rates had increased more in cities that had been fluoridated than in those that had not. As subsequent articles pointed out, these effects disappear if differences in the demographic structure of the two groups of cities are taken into account. [See Refs. (19, 20) for some recent British investigations and a guide to the earlier US studies.] One of the benefits (didactically speaking) was the helpful illustration of two methods of standardization (41). Standardization was also used recently in a slightly different context (31). It showed that, although the crude infant mortality rate is much higher in Massachusetts than in Sweden, if infant mortality rates in the two areas were standardized for birthweight, Massachusetts would actually have a slightly lower one. The point of the analysis was not to explain away or hide the differences in mortality rates, but rather to show that it is an advantage in birthweight, and not the superiority of Swedish hospital care, that gives Swedish infants a survival advantage. Although the country of birth seems as if it is the main study variable and birthweight simply a "nuisance factor," in reality, birthweight matters everything and country not at all. Luckily, as the accompanying editorial pointed out, of the two variables, birthweight (and through it, presumably the infant mortality rate) is the mod)fiable one. To many, the term multivariate analysis has come to mean a statistical model that uses regression-type equations and distributional assumptions to link observed values of a response variable to values of various explanatory variables. Up to this point, the discussion in this section has centered around yes/no responses and explanatory variables that were either naturally discrete (sex, race, country, faculty) or forced to be discrete (age group, birthweight group). These types of data lend themselves to such straightforward tabulation and computation of standardized rates (a technique known as a stratified arzalysis) that one might rightly ask what is "multivariate" about the method other than the fact that it involves three or more variables. The answer is that by averaging results over a number of cells (strata), analysis techniques such as that of Mantel-Haenszel (used to combine data from several 2 X 2 tables into a single summary) do, at least implicitly, assume that all tables are measuring a common odds ratio. If the underlying odds ratios are not the same in each table, then the single odds ratio produced by the Mantel-Haenszel technique measures a weighted average of these separate ratios, and since the weighting is related to the relative sizes of the separate tables, the average will be somewhat arbitrary. The same is true of rates that are computed with reference to some standard population‹they depend on the assumed mix of categories in the model population. This emphasizes a central issue in all multivariate analyses: One cannot adjust or standardize a comparison without making certain assumptions. Probably the best way to view statistical models is as "a series of approximations to the truth": one can realize that the assumptions (model) used to adjust a comparison may not be entirely correct but proceed as best one can, or one can forego any adjustment because one did not realize the need or was afraid to make assumptions. It is a choice between the results being approximately correct and being precisely wrong! To end this section, I discuss briefly situations in which the response variable is continuous rather than discrete (I shall discuss more complicated methods for standarizing rates, below), and address issues of matching and of adjustment by regression. In some experimental studies, it is possible to compare responses to two or more maneuvers applied to the same individual. The advantage of having each subject serve as his own control is obvious: the comparison is immediately fair with respect to an infinity of variables that could otherwise theoretically bias it. When this is not possible, the next best thing, using balancing or randomization (or both), to equalize the two groups receiving the different maneuvers, is often difficult. This is especially true if the numbers in the two groups are so small that it is impossible to balance them adequately, or if the study is an observational one and the groups have already been formed. For example, in a recent study (42) comparing the ventilatory function, as measured by forced expiratory volume (FEV), of workers who had worked in a vanadium factory for at least four months with that of an unexposed reference group, investigators matched the subjects for two variables known to influence lung function: age (to within two years) and cigarette smoking (to within five cigarettes daily). However, since the two groups differed by an average of 3.4 cm in height, a variable with a very strong relationship to FEV, some standardization or adjustment was required. The authors achieved this using the finding of Cole (17) that past age 20, the predicted FEV for a man of a certain age and height is approximately of the form FEV = height-squared x (a + b x age) Both members of each matched pair were already concordant for age and smoking; thus, if one simply divided each man's recorded FEV by his squared height, the resulting paired values could be taken as FEY's that were adjusted for one member being taller or shorter than the other. Since the effect was as though the pairs had been also matched for height, the comparison was carried out using a straightforward paired t-test on the differences in the pairs of adjusted FEV's. Although the task will often be more difficult than in this elegant example, the principle generally remains the same: one calculates what each subject's response would be expected to be if all of the variables that distort or bias the comparison were held equal, say at the mean of each covariable. The term analysis of covariance (3, 4) has generally been applied to adjustments of a simple additive nature, but as we have just seen, if some other relationship more appropriately and more accurately describes the way in which the covariate(s) affect the response, and if it is easy to derive, it is certainly preferable. Usually this relationship between response and confounders is estimated "internally" from the data at hand, unless the study is small and some outside norms (e.g. weight and height charts, dental maturity curves) are deemed better. Researchers generally feel safer using internal standardization; by doing so, they avoid problems of different measurement techniques, inappropriate reference samples, etc. In the vanadium study just cited, one could actually test Cole's FEV internally in the group of nonexposed workers. If the study did not have a pure unexposed group, and relied instead on the withingroup variation in the amount of exposure, one would probably treat the exposure more as a continuous variable and use a multiple regression approach.