SELECTED MULTIVARIATE TECHNIQUES In this section I discuss a number of multivariate techniques for analyzing discrete responses, techniques that have become popular in the last ten years. Discriminant Analysis Discriminant Analysis (4, 47) began as a method of predicting to which of several categories an individual belonged, using several pieces of information collected about him and similar information collected about past individuals known to belong to the various categories. It has come to have three main uses (see Table 1): (a) as a way of carrying out a multivariate t-test comparing two or more samples on several continuous-type responses simultaneously and as a means of controlling the false-positive results associated with separate analyses (33); (b) more in its original spirit, in screening, diagnosis and prognosis (32, 64); (c) as a form of multiple regression for categorical responses (43). If Discriminant Analysis is used in the second way, to simply construct a one-dimensional score from many variables, and if the scores one obtains are used as though they were the result of a single test (25), few distributional assumptions are needed regarding either the discriminating variables (indicants) or the resulting scores. Further, if one has sufficient numbers of proven cases one can use the empirical distributions of scores to construct score-specific predictions (25, 53). With fewer cases, one will need to fit some distribution to either the scores or to the discriminating variables. The third use, to adjust for disturbing variables before comparing proportions, or to study the effects of several variables on the probability of a certain yes/no outcome, is best discussed in the context of multiple logistic regression. Multiple Logistic Regression' Logit and probit curves (21) have been used for several years to study a binary response to a single stimulus variable. However, it was only in the early 1970s after the publication of three signal articles (2, 62, 65) and a comprehensive monograph (21) that the "logistic model" began to be used for studying multiple stimuli. It was not until the 1980s that the technique was integrated into biostatistics textbooks (3) and took its place as the primary method for analyzing the relationship between a binary response and several discrete or continuous stimulus variables. It now stands in the same relation to binary response data as classical regression does to continuous response data. To these descriptions of the "logic" of logistic regression, I add one point dealing with its historical evolution. If one works with the odds (rather than the probability) of a yes/no event in relation to a series of explanatory variables X1, X2, . . ., the logistic model implies that the logarithm of this odds can be written as log (odds of yes/no) = BO + Bl.X1 + B2.X2 + . . . If one thinks of the right-hand side of the equation as a score S, then it will have different distributions in the "yes" and "no" groups, just as in a discriminant analysis. The first justification for the multiple logistic model was that if the Xs in the "yes" and "no" populations follow two multivariate normal distributions, then the Ss will have univariate normal distributions. Then, if these two univariate normal distributions have equal variances, one obtains the logistic curve (62). It is still not well recognized that although these conditions are indeed sufficient to produce the logistic relationship, they are not necessary. First, one does not need multivariate normal Xs in order for the Ss to be approximately normal; if there are sufficiently many of them to add together, if they are reasonably uncorrelated, and if they do not have highly skewed distributions, the central limit theorem guarantees distributions of Ss that are close to normal. Second, one does not even need the Ss to have normal distributions: several other pairs of distributions of scores will also generate the logistic relationship. The interested reader can verify this for himself, using as an example the data in Table 1 of Reference (14), which shows two Poisson-like distributions with the score (number of symptoms) averaging 0.5 per individual in the "no" group and 2.7 in the "yes" group. The important point is that even though logistic regression is now regarded as simply a convenient functional form for linking probabilities to explanatory variables, it does have some historical and statistical basis. Epidemiologic studies, and their use of risk ratios (also called relative risks) to report comparisons from prospective (cohort) studies, have done much to popularize logistic regression (indeed one could say that the technique began with the Framingham Study). Studies involving a binary response and multiple stimuli do not need to force the stimulus variables into discrete categories required for a Mantel-Haenszel analysis but can use all the information in every variable: the coefficient for the main exposure of interest leads immediately to the odds ratio and the relative risk. In one recent study (34), the results were also presented as observed and expected numbers of cases, in much the same spirit as is done for comparisons of mortality rates. Logistic regression has also become quite popular for analyzing case control studies, as a result of some very significant insights into the logical connections with corresponding methods for cohort studies (12, 13, 56). Furthermore, as computing becomes cheaper, it probably will largely replace the traditional two-group linear discriminant analysis. It is a little more difficult to know how useful logistic regression will become for multicategory responses ("polytomous logistic regression"), since there are several ways one might contrast the categories (29). Recent work, performed in the context of trying to place patients into one of several diagnostic categories on the basis of a number of binary indicants (symptoms, findings, test results etc), suggests that some of these methods are at least feasible (A. Wijesinha, unpublished information). The arguments of Dawid (23) add further theoretical justification for choosing a more robust prospective model, such as logistic regression, over a retrospective one, such as discriminant analysis. By "prospective" Dawid means predicting responses from the given indicants, and by "retrospective" he means predicting the distribution of indicants from knowledge of the response. In spite of these theoretical advantages, however, some direct comparisons of various discrimination techniques have not always shown a definitive advantage for logistic regression (30, 61). As Fienberg (29) points out, however, there is a difference between using these competing methods for discrimination (where it is the overlapping part of the score distribution that contributes to misclassification rates) and using them to make accurate probability predictions or adjustments across the entire probability scale. The fact that discriminant analysis can hold its own in the task for which it was first designed is no guarantee that it will be equally good for other purposes. Nevertheless, since it is inexpensive, it will probably continue to be used to screen for possible influential confounding variables before undertaking a logistic regression. A disadvantage of logistic regression is that results are often presented as odds or log odds, or worse still, as unitless coefficients rather than using the more familiar probabilities. To aid with these nonlinear concepts, it is often appropriate to translate to log-odds back into the more familiar probability scale. Recent articles that used graphical methods (36, 38) or expected numbers of events (34) to describe the fitted models have been especially helpful in this regard. Log-Linear Models for Multiway Tables If the stimulus variables can all be considered categorical, binary response data can also be assembled into multiway contingency tables and analyzed using multiplicative models (the same one used to compute an expected cell entry in the simple 2 X 2 table), which become additive when transformed to a log scale. The logic behind these models and how they are fitted (almost always by computer iteration) is well described in recent textbooks (3, 10, 29). The attractiveness of log-linear models for multiway tables lies in their parallels with classical analysis of variance models, in their use as a way of standardizing comparisons of rates in complex data sets, and in the ease with which interactions and confounding variables can be identified. They agree with logistic models if one fits as many parameters as there are cells. The fits to the breast cancer incidence data discussed above are examples of a log-linear approach: the simplest curves involved points that were products of an average age-specific curve and different proportionality factors for the different cohorts. The best-fitting parameters (32 in the first "model" considered) could be fit by a variety of techniques, such as logistic regression of the 108 numerators and denominators on 32 dummy variables or a 20 by 12 by 2 contingency table analysis (with a number of cells missing because the cohorts were too young or the cancers occurring early in life to the furthest back cohorts were not in the registry). Some drawbacks to analyzing a binary response by a contingency table, rather than general log-linear regression, approach include the fact that it tends to treat the response variable the same way as the stimulus variables, that it worries about reproducing the interrelationships among the stimulus variables, and that variables that are not categorical have to be made so. Regression Methods for Life-Table Analysis Although the life-table (used in the broad sense for techniques that analyze the time until events happen) has long been an essential epidemiologic tool, it is only in the last decade that it has been adapted into a multivariate method (22, 46). As are most of the other methods described in this section, it is log-linear, with the log of the time-specific "mortality" rate (hazard) linked to the "average" hazard and to the explanatory variables through a linear regression. The main differences from logistic regression are that the "average" hazard is not a single quantity but a function of time and that it is estimated nonparametrically. In the simplest case, the relationship between the hazard and the explanatory variables is assumed to remain constant over time. This constancy does not seem to hold always (45, 55) and statistical tests based on this "proportional hazards" model (58) can be quite misleading. Fortunately, some work has emerged (44, 57) and more is under way to produce diagnostic tests for checking the appropriateness of the assumed model, and suggesting when effects of variables should be allowed to vary over time. POSSIBLE PITFALLS IN MULTIVARIATE ANALYSIS This section deals with potential risks in the use of multivariate analyses. I do not discuss the risks of specific techniques, details of which will be found in the appropriate textbooks, but rather the issues that cut across techniques, and that arise simply because data are multivariate. Indeed the main message is that the more multivariate the data, the greater the opportunities for problems. Adding Noise Although I stress above that including other variables in the analysis of comparative study can sharpen a comparison, it can also dull it, especially if the user allows a stepwise regression to decide which of many other variables are important. The gain or loss in precision will depend on how strongly these other variables influence the response being studied. For example, including the last digit of each individual's telephone number in a multiple regression will waste one degree of freedom or the equivalent of one individual. Worse still, if the average value of this variable is not equal in the groups being compared (and in any one study with small groups, it almost certainly will not), any "adjustments" to the responses on the basis of this variable will actually add unwanted variation. Although users try to guard against such occurrences by first testing whether the slope of the observed relationship is real rather than random, they often use a lax criterion (e.g. a p-value less than, say, 0.20). This, together with the often large numbers of "possibly explanatory" variables "offered" to a regression, adds to the chances of decreasing rather than increasing the precision of a comparison. One way to avoid this artifact of chance is first to split one's data set into two or more smaller sets and retain only those variables that are influential in each subset. Overoptimism Regarding Future Performance The performance of discriminant functions or prediction equations constructed from a data-set is often judged by "resimulation" or by seeing how well the system "would have done" if it were used to classify the individuals in the data set. The results are generally overoptimistic for two reasons. First, because the weights were chosen on the very basis of doing well in this data set, they may well have "chased" or been fooled by any data patterns that were peculiar to that dataset. The random variation in a new dataset is unlikely to match the random peculiarities of the "training" dataset. As a result, knowing only a finite sample, but thinking of it as a universe, the system will be surprised a little more (16). Second, if one has enough candidate predictors to choose from, one is bound to find some coincidences. Similarly, if one builds an equation with enough variables, one will also get an irreproducibly good fit. There are a number of techniques for obtaining less optimistically biased estimates of future misclassification rates without actually doing a prospective test (27). However, they do not apply to the second bias mentioned above. In this latter situation, one needs to evaluate the system on a separate dataset. A number of studies that claimed high prediction accuracy solely on the basis of resimulation have "regressed toward the mean" (8, 24, 49, 63). Others have recognized this danger and have included the validation as an integral part of the task (53); one has even subjected the prediction system, which incidentally was constructed by logistic regression, to a comparative trial (52). There has been speculation that there is some "natural law" that no matter how many variables are available for prediction, only four or five will finally remain in any stepwise regression (18). This claim would need to be examined more carefully, especially with regard to the influence of typical sample sizes. It does emphasize one point, namely that prediction of binary outcomes is a considerable task, given the considerable nonreducible uncertainty inherent in an all or nothing event. A method of measuring the attainable discrimination in a dataset and of deciding whether the search for predictors might be worth the effort is given in (32). MULTIVARIATE ANALYSIS 177 One Model for All A recent example points up the serious inadequacy in a common approach to statistical predictions. The study asked whether two different types of gallstones could be distinguished on the basis of the features seen in a radiograph (26). Univariate analyses revealed that regardless of any other features, stones that appeared to be buoyant were invariably of one type; those that were not buoyant were sometimes of one type, sometimes the other. In spite of this, buoyancy ranked only third in the linear discriminant analysis which tried to predict the variation in types. This is clearly a situation in which buoyant cases could have been classified immediately, removed from the dataset, and discriminant analysis applied to the remaining cases. The unconditional "one model for all" approach is simplistic and possibly even misleading. Technically, the discriminant model could be made conditional through the use of interaction terms, provided one could anticipate which ones to include. An alternative, and more natural approach, which first partitions subjects on the most important variable, then partitions each of these subgroups separately, and so on in a branching fashion, is provided by recursive partitioning (also called Automatic Interaction Detection), a recent nonparametric classification) system for use with larger data sets (25, 37). For smaller ones, the "kernel method" (1) seems to hold some promise. Explaining Away a Difference In the dental caries survey mentioned above, one would probably collect information on the frequency of visits to a dentist, and one might be tempted to take this variable into account in a multiple regression, when studying the effects of other risk factors on caries. If more caries result in more visits, then including the number of visits as an "explanatory" variable will lessen the observed impact of the other (real) risk factors: it will be one of the first variables to enter the regression equation and will thus "explain away" whatever variance might have been more appropriately accounted for by the risk factors being studied. Similar misinterpretations can arise if one includes as an explanatory variable one which is intermediate in the stimulus response chain, as for example if one allowed for the amounts of medication given in a study comparing the lengths of stay following an operation performed in two different ways. Although it probably draws the correct conclusion, a recent study (39) shows just how easy it is to adjust away a difference, especially if other factors are not held constant. The authors state that the "data are in agreement with the hypothesis" that differences in weight, rather than in pO2 (Partial Oxygen Pressure), explain most if not all of the observed differences in blood pressure between children of the same age living at different altitudes. What is alarming is that the data might also be in agreement with a similarly worded hypothesis stated in terms of family income, education, or any other variable that may be associated in a non causal way with blood pressure, and on which high altitude children score lower than the comparison group. CONCLUDING REMARKS Investigation in the health sciences will continue to be of a multivariate nature. The statistical tools for dealing with the data generated by these studies are now largely in place; the challenge and the obligation will be to use them prudently (7, 59). Even though a number of lines of enquiry have become decidedly more complex in the past few decades (witness for example the current thinking on cholesterol and heart disease), by and large, questions still tend to be posed one dimension at a time. The same remains true in multivariate analysis, where even though the computations may sound high-dimensional, the statistical tests are univariate in spirit. ACKNOWLEDGMENTS I would like to thank my colleagues for their help with this article. 1. Aitchinson, J., Aitken, C. G. G. 1976. Multivariate binary discrimination by the kernel method. Riometrika 63: 413-20 2. Anderson, J. A. 1972. separate sample logistic discrimination. Biometrika 59: 19-35 3. Anderson, S., Auquier, A., Hauck, W. W., Oakes, D., Vandaele, W., Weisberg, H. I. 1980. Statistical Methods for Comparative Studies. New York: Wiley. 289 pp. 4. Armitage, P. 1971. Statistical Methods in Medical Research Oxford/ Edinburgh: Blackwell. 504 pp. 5. Armstrong, J. S. 1967. Deviation of theory by means of factor analysis or Tom Swift and his electric factor analysis machine. Am. Stat. 21:17-21 6. Baker, R. J., Nelder, J. A. 1978. Manual for the GLIM system of generalized linear interactive modeling Oxford GB: Numerical Algorithms Group 7. Barrett-Connor, E. 1979. Infectious and chronic disease epidemiology: Separate and unequal? Am. J. Epidemiology 109: 245-49 8. Bell, R. S., Loop, J. W. 1971. The utility and futility of radiographic skull examination for trauma. N. Engl. J. Med. 284:236 39 9. Bickel, P. J., Hammel, E. A., O'Connell, J. W. 1975. Sex bias in graduate admissions: Data from Berkeley. Science 187:398~104 10. Bishop, Y. M. M., Fienberg, S. E., Holland P. W. 1976. Discrete Multivariate Analysis. Theory and Practice. Cambridge: MIT Press. 557 pp. 11. Boag, P. T., Grant, P. R. 1981. Intense natural selection in a population of Darwin's finches (Geospizinae) in the Galapagos. Science 214:82-84 12. Breslow, N. E. 1982. Design and analysis of case-control studies. Ann. Rev. Public Health 3:29-54 13. Breslow, N. E., Day, N. E. 1980. Statistical Methods in Cancer Research I. The Analysis of Case-Control Studies. Lyon: Intl. Agency Res. Cancer. 338 pp. 14. Carpenter, R. G., Gardner, A., Pursall, E., McWeeny, P. M., Emery, J. L. 1979. Identification of some infants at immediate risk of dying unexpectedly and justifying intensive study. Lancet 2: 343-46 15. Clark, D. W. 1981. A vocabulary for preventive and community medicine. In Preventive and Community Medicine, ed. D. W. Clark, B. MacMahon, pp. 3-15. Boston: Little, Brown. 794 pp. 2nd ed. 16. Cochran, W. G., Hopkins, C. E. 1961. Some calcification problems with multivariate qualitative data. Biometrics 17:10 32 17. Cole, T. J. 1975. Linear and proportional regression models in the prediction of ventilatory function. J. R. Statist. Soc. A 138:297-337 18. Coles, L. S., Brown, B. W., Engelhard, C., Halpern, J., Fries, J. F. 1980. Determining the most valuable clinical variables: A stepwise multiple logistic regression program. Meth Inform. Med 19:42-49 19. Cook-Mozaffari, P., Bulusu, L., Doll, R. 1981. Fluoridation of water supplies and cancer mortality. I. A search for an effect in the UK on risk of death from cancer. J. Epidemiol. Community Health 35:227-32 20. Cook-Mozaffari, P., Doll R. 1981. Fluoridation of water supplies and cancer mortality. II. Mortality trends after fluoridation. J. Epidemiol. Community Health 35:233-38 21. Cox, D.R. 1970. The Analysis of Binary Data London: Methuen. 142 pp. 22. Cox, D. R. 1972. Regression models and life tables (with discussion). J. R. Stat. Soc. P 34:187-202 23. Dawid, A. P. 1976. Properties of diagnostic data distributions. Biometrics 32:647-58 24. DeSmet, A. A., Fryback, D. G., Thornbury, J. R. 1979. A second look at the utility of radiographic skull examination for trauma. Am. J. Roentgen. 132:95-99 25. Diehr, P., Wood, R. W., Barr, V., Wolcott, B., Slay, L., Tompkins, R. K. 1981. Occult headache: Presenting symptoms and diagnostic rules to identify patients with tension and migraine headache. J. Chron Dis. . 34:147-58 26. Dolgin, S. M., Schwartz, J. S., Kressel, H. Y., Soloway, R. D., Miller, W. T. Trotman, B., Soloway, A. S., Good, L. I. 1981. Identification of patients with cholesterol or pigment gallstones by discriminant analysis of radiologic features. N. Engl. J. Med 304:808-11 27. Efron, B. 1979. Bootstrap methods: Another look at the jackknife. Ann. Stat. 71:1-26 28. Feinstein, A. R. 1977. Clinical Biostatistics. St. Louis: Mosby. 468 pp. 29. Fienberg, S. E. 1980. The Analysis of Cross-Classified Categorical Data Cambridge: MIT Press. 198 pp. 2nd ed. 30. Gardner, M. J., Barker, D. J. P. 1975. A case study in techniques of allocation. Biometrics 31:931-42 31. Guyer, B., Wallach, L. A., Rosen, S. L. 1982. Birth-weight-standardized neonatal mortality rates and the prevention of low birth weight: How does Massachusetts compare with Sweden? N. Engl. J. Med 306:1230-33 32. Hanley, J. A., McNeil, B. J. 1982. Maximum attainable discrimination and the utilization of radiologic examinations. J. Chron. Dis. 35:601-11 33. Harris, R. J. 1975. A Primer of Multivariate Statistics. New York: Academic. 332 pp. 34. Heinonen O. P., Stone, D., Monson, R. R., Hook E. B., Shapiro, S. 1977. Cardiovascular birth defects and antenatal exposure to female sex hormones. N. Engl. J. Med 296:67-70 35. Henry, R. C., Hidy, G. M. 1979. Multivariate analysis of particulate sulfate and other air quality variables by principal components‹pt. I. Annual data from Los Angeles and New York. Atmosph Environ. 13:1581-96 36. Higgins, M. W., Keller, J. B., Becher, M., Howatt, W., Landis, J. R., et al 1982. An index of risk for obstructive airways disease. Am. Rev. Respir. Dis. 125:144-51 37. Hooton, T. M., Haley, R. W., Culver, D. H., Morgan, W. M. 1981. The joint associations of multiple risk factors with the occurrence of nosocomial infection. Am. J Med 70:960-70 38. Horning, S. J., Hoppe, R. T., Kaplan, H. S., Rosenberg, S. A. 1981. Female reproductive potential after treatment for Hodgkin's disease. N. Engl. J. Med 304:1377-82 39. Jongbloed, L. S., HoLman, A. 1983. Altitude and blood pressure in children. .1: Chron. Dis. In press 40. Kaplan, R. M., Bush, J. W., Berry, C. C. 1976. Health status: Types of validity and the index of well-being. Health Serv. Res. 478-505 41. Kinlen, L., Doll, R. 1981. Fluoridation of water supplies and cancer mortality. III. A re-examination of mortality in cities in the USA. J. Epidemiol. Community Health 35:239~4 42. Kiviluoto, M. 1980. Observations on the lungs of vanadium workers. Br. J. Indust. Med 37:363-66 43. Kleinbaum, D. G., Kupper, L. L. 1978. Applied Regression Analysis and Other Multivariable Methods. North Scituate, Mass: Duxbury. 556 pp.