SELECTED MULTIVARIATE TECHNIQUES

In this section I discuss a number of multivariate techniques for 
analyzing discrete responses, techniques that have become popular in the 
last ten years.

Discriminant Analysis

Discriminant Analysis (4, 47) began as a method of predicting to which 
of several categories an individual belonged, using several pieces of 
information collected about him and similar information collected about 
past individuals known to belong to the various categories. It has come 
to have three main uses (see Table 1): (a) as a way of carrying out a 
multivariate t-test comparing two or more samples on several 
continuous-type responses simultaneously and as a means of controlling 
the false-positive results associated with separate analyses (33); (b) 
more in its original spirit, in screening, diagnosis and prognosis (32, 
64); (c) as a form of multiple regression for categorical responses 
(43).

If Discriminant Analysis is used in the second way, to simply construct 
a one-dimensional score from many variables, and if the scores one 
obtains are used as though they were the result of a single test (25), 
few distributional assumptions are needed regarding either the 
discriminating variables (indicants) or the resulting scores. Further, 
if one has sufficient numbers of proven cases one can use the empirical 
distributions of scores to construct score-specific predictions (25, 
53). With fewer cases, one will need to fit some distribution to either 
the scores or to the discriminating variables. The third use, to adjust 
for disturbing variables before comparing proportions, or to study the 
effects of several variables on the probability of a certain yes/no 
outcome, is best discussed in the context of multiple logistic 
regression.

Multiple Logistic Regression'

Logit and probit curves (21) have been used for several years to study a 
binary response to a single stimulus variable. However, it was only in 
the early 1970s after the publication of three signal articles (2, 62, 
65) and a comprehensive monograph (21) that the "logistic model" began 
to be used for studying multiple stimuli. It was not until the 1980s 
that the technique was integrated into biostatistics textbooks (3) and 
took its place as the primary method for analyzing the relationship 
between a binary response and several discrete or continuous stimulus 
variables. It now stands in the same relation to binary response data as 
classical regression does to continuous response data.

To these descriptions of the "logic" of logistic regression, I add one 
point dealing with its historical evolution. If one works with the odds 
(rather than the probability) of a yes/no event in relation to a series 
of explanatory variables X1, X2, . . ., the logistic model implies that 
the logarithm of this odds can be written as

     log (odds of yes/no) = BO + Bl.X1 + B2.X2 + . . .

If one thinks of the right-hand side of the equation as a score S, then 
it will have different distributions in the "yes" and "no" groups, just 
as in a discriminant analysis. The first justification for the multiple 
logistic model was that if the Xs in the "yes" and "no" populations 
follow two multivariate normal distributions, then the Ss will have 
univariate normal distributions. Then, if these two univariate normal 
distributions have equal variances, one obtains the logistic curve (62). 
It is still not well recognized that although these conditions are 
indeed sufficient to produce the logistic relationship, they are not 
necessary. First, one does not need multivariate normal Xs in order for 
the Ss to be approximately normal; if there are sufficiently many of 
them to add together, if they are reasonably uncorrelated, and if they 
do not have highly skewed distributions, the central limit theorem 
guarantees distributions of Ss that are close to normal. Second, one 
does not even need the Ss to have normal distributions: several other 
pairs of distributions of scores will also generate the logistic 
relationship. The interested reader can verify this for himself, using 
as an example the data in Table 1 of Reference (14), which shows two 
Poisson-like distributions with the score (number of symptoms) averaging 
0.5 per individual in the "no" group and 2.7 in the "yes" group. The 
important point is that even though logistic regression is now regarded 
as simply a convenient functional form for linking probabilities to 
explanatory variables, it does have some historical and statistical 
basis.

Epidemiologic studies, and their use of risk  ratios (also called
relative risks) to report  comparisons from prospective (cohort) 
studies,  have done much to popularize logistic regression  (indeed 
one could say that the technique began  with the Framingham Study). 
Studies involving a  binary response and multiple stimuli do not 
need  to force the stimulus variables into discrete  categories 
required for a Mantel-Haenszel  analysis but can use all the information
in  every variable: the coefficient for the main  exposure of interest
leads immediately to the  odds ratio and the relative risk. In one 
recent  study (34), the results were also presented as  observed and 
expected numbers of cases, in much  the same spirit as is done for 
comparisons of mortality rates.

Logistic regression has also become quite  popular for analyzing case 
control studies, as a  result of some very significant insights into  
the logical connections with corresponding  methods for cohort studies 
(12, 13, 56).  Furthermore, as computing becomes cheaper, it  probably
will largely replace the traditional  two-group linear discriminant 
analysis. It is a  little more difficult to know how useful  logistic 
regression will become for  multicategory responses ("polytomous 
logistic  regression"), since there are several ways one  might contrast 
the categories (29). Recent work,  performed in the context of trying to 
place  patients into one of several diagnostic  categories on the basis 
of a number of binary  indicants (symptoms, findings, test results  etc), 
suggests that some of these methods are at  least feasible (A. Wijesinha, 
unpublished  information).

The arguments of Dawid (23) add further  theoretical justification for 
choosing a more  robust prospective model, such as logistic  regression, 
over a retrospective one, such as  discriminant analysis. By "prospective" 
Dawid  means predicting responses from the given  indicants, and by 
"retrospective" he means  predicting the distribution of indicants from  
knowledge of the response.
 
In spite of these theoretical advantages,  however, some direct comparisons
of various  discrimination techniques have not always shown  a definitive
advantage for logistic regression  (30, 61). As Fienberg (29) points out, 
however,  there is a difference between using these  competing methods 
for discrimination (where it  is the overlapping part of the score  
distribution that contributes to  misclassification rates) and using 
them to make  accurate probability predictions or adjustments  across 
the entire probability scale. The fact  that discriminant analysis can 
hold its own in  the task for which it was first designed is no  guarantee
that it will be equally good for other  purposes. Nevertheless, since it 
is inexpensive,  it will probably continue to be used to screen  for 
possible influential confounding variables  before undertaking a logistic
regression.

A disadvantage of logistic regression is that  results are often presented 
as odds or log odds,  or worse still, as unitless coefficients rather  than
using the more familiar probabilities. To  aid with these nonlinear concepts, 
it is often  appropriate to translate to log-odds back into  the more familiar
probability scale. Recent  articles that used graphical methods (36, 38) or
expected numbers of events (34) to describe the  fitted models have been
especially helpful in  this regard.

Log-Linear Models for Multiway Tables

If the stimulus variables can all be considered  categorical, binary
response data can also be  assembled into multiway contingency tables
and  analyzed using multiplicative models (the same  one used to compute
an expected cell entry in  the simple 2 X 2 table), which become additive
when transformed to a log scale. The logic  behind these models and how they
are fitted  (almost always by computer iteration) is well  described in recent
textbooks (3, 10, 29). The  attractiveness of log-linear models for multiway
tables lies in their parallels with classical  analysis of variance models,
in their use as a  way of standardizing comparisons of rates in  complex data
sets, and in the ease with which  interactions and confounding variables can
be identified. They agree with logistic models if  one fits as many parameters
as there are cells.  The fits to the breast cancer incidence data  discussed
above are examples of a log-linear  approach: the simplest curves involved
points  that were products of an average age-specific  curve and different
proportionality factors for  the different cohorts. The best-fitting  parameters
(32 in the first "model" considered)  could be fit by a variety of techniques,
such as  logistic regression of the 108 numerators and  denominators on 32 dummy
variables or a 20 by 12  by 2 contingency table analysis (with a number  of
cells missing because the cohorts were too  young or the cancers occurring
early in life to  the furthest back cohorts were not in the  registry). Some
drawbacks to analyzing a binary  response by a contingency table, rather than
general log-linear regression, approach include  the fact that it tends to treat
the response  variable the same way as the stimulus variables,  that it worries
about reproducing the  interrelationships among the stimulus variables,  and
that variables that are not categorical have  to be made so.

Regression Methods for Life-Table Analysis

Although the life-table (used in the broad sense  for techniques that analyze
the time until  events happen) has long been an essential  epidemiologic tool,
it is only in the last  decade that it has been adapted into a  multivariate
method (22, 46). As are most of the  other methods described in this section,
it is  log-linear, with the log of the time-specific  "mortality" rate (hazard)
linked to the  "average" hazard and to the explanatory  variables through a
linear regression. The main  differences from logistic regression are that
the "average" hazard is not a single quantity  but a function of time and
that it is estimated  nonparametrically. In the simplest case, the  relationship
between the hazard and the  explanatory variables is assumed to remain  constant
over time. This constancy does not seem  to hold always (45, 55) and statistical
tests  based on this "proportional hazards" model (58)  can be quite misleading.
Fortunately, some work  has emerged (44, 57) and more is under way to  produce
diagnostic tests for checking the  appropriateness of the assumed model, and
suggesting when effects of variables should be  allowed to vary over time.

POSSIBLE PITFALLS IN MULTIVARIATE ANALYSIS

This section deals with potential risks in the  use of multivariate analyses.
I do not discuss  the risks of specific techniques, details of  which will be
found in the appropriate  textbooks, but rather the issues that cut across
techniques, and that arise simply because data  are multivariate. Indeed the
main message is  that the more multivariate the data, the greater  the
opportunities for problems.

Adding Noise

Although I stress above that including other  variables in the analysis of
comparative study  can sharpen a comparison, it can also dull it,  especially
if the user allows a stepwise  regression to decide which of many other
variables are important. The gain or loss in  precision will depend on how
strongly these  other variables influence the response being  studied. For
example, including the last digit  of each individual's telephone number in
a  multiple regression will waste one degree of  freedom or the equivalent
of one individual.  Worse still, if the average value of this  variable is
not equal in the groups being  compared (and in any one study with small
groups, it almost certainly will not), any  "adjustments" to the responses
on the basis of  this variable will actually add unwanted  variation.
Although users try to guard against  such occurrences by first testing
whether the  slope of the observed relationship is real  rather than random,
they often use a lax  criterion (e.g. a p-value less than, say, 0.20).  This,
together with the often large numbers of  "possibly explanatory" variables
"offered" to a  regression, adds to the chances of decreasing  rather than
increasing the precision of a  comparison. One way to avoid this artifact of
chance is first to split one's data set into two  or more smaller sets and
retain only those  variables that are influential in each subset.

Overoptimism Regarding Future Performance

The performance of discriminant functions or  prediction equations constructed
from a data-set  is often judged by "resimulation" or by seeing  how well the
system "would have done" if it were  used to classify the individuals in the
data  set. The results are generally overoptimistic  for two reasons. First,
because the weights were  chosen on the very basis of doing well in this  data
set, they may well have "chased" or been  fooled by any data patterns that
were peculiar  to that dataset. The random variation in a new  dataset is
unlikely to match the random  peculiarities of the "training" dataset. As a
result, knowing only a finite sample, but  thinking of it as a universe, the
system will be  surprised a little more (16). Second, if one has  enough
candidate predictors to choose from, one  is bound to find some coincidences.
Similarly,  if one builds an equation with enough variables,  one will also
get an irreproducibly good fit.  There are a number of techniques for obtaining
less optimistically biased estimates of future  misclassification rates without
actually doing a  prospective test (27). However, they do not  apply to the
second bias mentioned above. In  this latter situation, one needs to evaluate
the  system on a separate dataset. A number of  studies that claimed high
prediction accuracy  solely on the basis of resimulation have  "regressed toward
the mean" (8, 24, 49, 63).  Others have recognized this danger and have  included
the validation as an integral part of  the task (53); one has even subjected the
prediction system, which incidentally was  constructed by logistic regression, to
a  comparative trial (52).

There has been speculation that there is some  "natural law" that no matter how
many variables  are available for prediction, only four or five  will finally
remain in any stepwise regression  (18). This claim would need to be examined more
carefully, especially with regard to the  influence of typical sample sizes. It does
emphasize one point, namely that prediction of  binary outcomes is a considerable task,
given  the considerable nonreducible uncertainty  inherent in an all or nothing event.
A method of  measuring the attainable discrimination in a  dataset and of deciding
whether the search for  predictors might be worth the effort is given in  (32).


    MULTIVARIATE ANALYSIS   177

One Model for All

A recent example points up the serious  inadequacy in a common approach to statistical
predictions. The study asked whether two  different types of gallstones could be
distinguished on the basis of the features seen  in a radiograph (26). Univariate
analyses  revealed that regardless of any other features,  stones that appeared
to be buoyant were  invariably of one type; those that were not  buoyant were sometimes
of one type, sometimes  the other. In spite of this, buoyancy ranked  only third in the
linear discriminant analysis  which tried to predict the variation in types.  This is
clearly a situation in which buoyant  cases could have been classified immediately,
removed from the dataset, and discriminant  analysis applied to the remaining cases.
The  unconditional "one model for all" approach is  simplistic and possibly even 
misleading.
Technically, the discriminant model could be  made conditional through the use of
interaction  terms, provided one could anticipate which ones  to include. An alternative,
and more natural  approach, which first partitions subjects on the  most important 
variable,
then partitions each of  these subgroups separately, and so on in a  branching fashion,
is provided by recursive  partitioning (also called Automatic Interaction  Detection),
a recent nonparametric  classification) system for use with larger data  sets (25, 37).
For smaller ones, the "kernel  method" (1) seems to hold some promise.

Explaining Away a Difference

In the dental caries survey mentioned above, one  would probably collect information
on the  frequency of visits to a dentist, and one might  be tempted to take this
variable into account in  a multiple regression, when studying the effects  of other
risk factors on caries. If more caries  result in more visits, then including the
number  of visits as an "explanatory" variable will  lessen the observed impact of
the other (real)  risk factors: it will be one of the first  variables to enter the
regression equation and  will thus "explain away" whatever variance might  have been
more appropriately accounted for by  the risk factors being studied. Similar
misinterpretations can arise if one includes as  an explanatory variable one which
is  intermediate in the stimulus response chain, as  for example if one allowed for
the amounts of  medication given in a study comparing the  lengths of stay
following an operation performed  in two different ways. Although it probably
draws the correct conclusion, a recent study  (39) shows just how easy it is to
adjust away a  difference, especially if other factors are not  held constant.
The authors state that the "data  are in agreement with the hypothesis" that
differences in weight, rather than in pO2  (Partial Oxygen Pressure), explain most
if not  all of the observed differences in blood  pressure between children of the
same age living  at different altitudes. What is alarming is that  the data might
also be in agreement with a  similarly worded hypothesis stated in terms of  family
income, education, or any other variable  that may be associated in a non causal way
with  blood pressure, and on which high altitude  children score lower than the
comparison group.

CONCLUDING REMARKS

Investigation in the health sciences will continue to be of a 
multivariate nature. The statistical tools for dealing with the data 
generated by these studies are now largely in place; the challenge and 
the obligation will be to use them prudently (7, 59). Even though a 
number of lines of enquiry have become decidedly more complex in the 
past few decades (witness for example the current thinking on 
cholesterol and heart disease), by and large, questions still tend to be 
posed one dimension at a time. The same remains true in multivariate 
analysis, where even though the computations may sound high-dimensional, 
the statistical tests are univariate in spirit.

ACKNOWLEDGMENTS

I would like to thank my colleagues for their help with this article.

1.  Aitchinson, J., Aitken, C. G. G. 1976. Multivariate binary 
discrimination by the kernel method. Riometrika 63: 413-20

2. Anderson, J. A. 1972. separate sample logistic discrimination. 
Biometrika 59: 19-35

3. Anderson, S., Auquier, A., Hauck, W. W., Oakes, D., Vandaele, W., 
Weisberg, H. I. 1980. Statistical Methods for Comparative Studies. New 
York: Wiley. 289 pp.

4. Armitage, P. 1971. Statistical Methods in Medical Research Oxford/ 
Edinburgh: Blackwell. 504 pp.

5. Armstrong, J. S. 1967. Deviation of theory by means of factor 
analysis or Tom Swift and his electric factor analysis machine. Am. 
Stat. 21:17-21

6. Baker, R. J., Nelder, J. A. 1978. Manual for the GLIM system of 
generalized linear interactive modeling Oxford

GB: Numerical Algorithms Group

7.  Barrett-Connor, E. 1979. Infectious and chronic disease 
epidemiology: Separate and unequal? Am. J. Epidemiology 109: 245-49

8.  Bell, R. S., Loop, J. W. 1971. The utility and futility of 
radiographic skull examination for trauma. N. Engl. J. Med. 284:236 39

9. Bickel, P. J., Hammel, E. A., O'Connell, J. W. 1975. Sex bias in 
graduate admissions: Data from Berkeley. Science 187:398~104

10. Bishop, Y. M. M., Fienberg, S. E., Holland P. W. 1976. Discrete 
Multivariate Analysis. Theory and Practice. Cambridge: MIT Press. 557 
pp.

11. Boag, P. T., Grant, P. R. 1981. Intense natural selection in a 
population of Darwin's finches (Geospizinae) in the Galapagos. Science 
214:82-84

12. Breslow, N. E. 1982. Design and analysis of case-control studies. 
Ann. Rev. Public Health 3:29-54

13. Breslow, N. E., Day, N. E. 1980. Statistical Methods in Cancer 
Research I. The Analysis of Case-Control Studies. Lyon: Intl. Agency 
Res. Cancer. 338 pp.

14. Carpenter, R. G., Gardner, A., Pursall, E., McWeeny, P. M., Emery, 
J. L. 1979. Identification of some infants at immediate risk of dying 
unexpectedly and justifying intensive study. Lancet 2: 343-46

15. Clark, D. W. 1981. A vocabulary for preventive and community 
medicine. In Preventive and Community Medicine, ed. D. W. Clark, B. 
MacMahon, pp.
3-15. Boston: Little, Brown. 794 pp. 2nd ed.

16. Cochran, W. G., Hopkins, C. E. 1961. Some calcification problems 
with multivariate qualitative data. Biometrics 17:10 32

17. Cole, T. J. 1975. Linear and proportional regression models in the 
prediction of ventilatory function. J. R. Statist. Soc. A 138:297-337

18. Coles, L. S., Brown, B. W., Engelhard, C., Halpern, J., Fries, J. 
F. 1980. Determining the most valuable clinical variables: A stepwise 
multiple logistic regression program. Meth Inform. Med 19:42-49

19. Cook-Mozaffari, P., Bulusu, L., Doll, R. 1981. Fluoridation of 
water supplies and cancer mortality. I. A search for an effect in the UK 
on risk of death from cancer. J. Epidemiol. Community Health 35:227-32

20. Cook-Mozaffari, P., Doll R. 1981. Fluoridation of water supplies 
and cancer mortality. II. Mortality trends after fluoridation. J. 
Epidemiol. Community Health 35:233-38

21. Cox, D.R. 1970. The Analysis of Binary
    Data London: Methuen. 142 pp.

22. Cox, D. R. 1972. Regression models and life tables (with 
discussion). J. R. Stat. Soc. P 34:187-202

23. Dawid, A. P. 1976. Properties of diagnostic data distributions. 
Biometrics 32:647-58

24. DeSmet, A. A., Fryback, D. G., Thornbury, J. R. 1979. A second 
look at the utility of radiographic skull examination for trauma. Am. J. 
Roentgen. 132:95-99

25. Diehr, P., Wood, R. W., Barr, V., Wolcott, B., Slay, L., Tompkins, 
R. K. 1981. Occult headache: Presenting symptoms and diagnostic rules to 
identify patients with tension and migraine headache. J. Chron Dis.�. 
34:147-58

26. Dolgin, S. M., Schwartz, J. S., Kressel, H. Y., Soloway, R. D., 
Miller, W. T. Trotman, B., Soloway, A. S., Good, L. I. 1981. 
Identification of patients with cholesterol or pigment gallstones by 
discriminant analysis of radiologic features. N. Engl. J. Med 304:808-11

27. Efron, B. 1979. Bootstrap methods: Another look at the jackknife. 
Ann. Stat. 71:1-26

28. Feinstein, A. R. 1977. Clinical Biostatistics. St. Louis: Mosby. 
468 pp.

29. Fienberg, S. E. 1980. The Analysis of Cross-Classified Categorical 
Data Cambridge: MIT Press. 198 pp. 2nd ed.

30. Gardner, M. J., Barker, D. J. P. 1975. A case study in techniques 
of allocation. Biometrics 31:931-42

31. Guyer, B., Wallach, L. A., Rosen, S. L. 1982. 
Birth-weight-standardized neonatal mortality rates and the prevention of 
low birth weight: How does Massachusetts compare with Sweden? N. Engl. 
J. Med 306:1230-33

32. Hanley, J. A., McNeil, B. J. 1982. Maximum attainable 
discrimination and the utilization of radiologic examinations. J. Chron. 
Dis. 35:601-11

33. Harris, R. J. 1975. A Primer of Multivariate Statistics. New York: 
Academic. 332 pp.

34. Heinonen O. P., Stone, D., Monson, R. R., Hook E. B., Shapiro, S. 
1977. Cardiovascular birth defects and antenatal exposure to female sex 
hormones. N. Engl. J. Med 296:67-70

35. Henry, R. C., Hidy, G. M. 1979. Multivariate analysis of 
particulate sulfate and other air quality variables by principal 
components�pt. I. Annual data from Los Angeles and New York. Atmosph 
Environ. 13:1581-96

36. Higgins, M. W., Keller, J. B., Becher, M., Howatt, W., Landis, J. 
R., et al 1982. An index of risk for obstructive airways disease. Am. 
Rev. Respir. Dis. 125:144-51

37. Hooton, T. M., Haley, R. W., Culver, D. H., Morgan, W. M. 1981. 
The joint associations of multiple risk factors with the occurrence of 
nosocomial infection. Am. J Med 70:960-70

38. Horning, S. J., Hoppe, R. T., Kaplan, H. S., Rosenberg, S. A. 
1981. Female reproductive potential after treatment for Hodgkin's 
disease. N. Engl. J. Med 304:1377-82

39. Jongbloed, L. S., HoLman, A. 1983. Altitude and blood pressure in 
children. .1: Chron. Dis. In press

40. Kaplan, R. M., Bush, J. W., Berry, C. C. 1976. Health status: 
Types of validity and the index of well-being. Health Serv. Res. 478-505

41. Kinlen, L., Doll, R. 1981. Fluoridation of water supplies and 
cancer mortality. III. A re-examination of mortality in cities in the 
USA. J. Epidemiol. Community Health 35:239~4

42. Kiviluoto, M. 1980. Observations on the lungs of vanadium workers. 
Br. J. Indust. Med 37:363-66

43. Kleinbaum, D. G., Kupper, L. L. 1978. Applied Regression Analysis 
and Other Multivariable Methods. North Scituate, Mass: Duxbury. 556 pp.