NOTES ON KKMN CHAPTER 2 jh 1999.05.29 2-1 === The main reason for making these distinctions is the implication for the method of analysis in Table 2-1 on page 12 of the text. Note that the methods of analysis segregate according to the nature of the response ("dependent") variable, and not according to the nature of the determinants ("independent" variables). Note also with respect to Table 2-1 that (1) logistic regression can handle polytomous responses, not just dichotomous ones. (2) Poisson regression involves responses measured as COUNTS (0, 1, 2, ....), not just any discrete scale. See the previously cited article on appropriate uses of multivariate analysis for a more complete version of Table 2-1. This is not to say that the level of measurement of the determinants is irrelevant. If they have more than two levels, categorical variables need special representation as "independent" variables. 2-1-1 ===== I would have started with section 2-1-3, since it doesn't make as much sense to me to talk of "discrete" or "continuous" unless I can put the values on a numerical scale. i.e. Figures 2-1 and 2-2 assume a numerical scale. One cannot even discuss the "discreteness" OR "continuous-ness" of a nominal variable like blood group or gender or language or ethnic group. By the way, one thinks of gender as a nominal variable with two categories. Other hand, at what level of "preciseness of measurement" would one classify the variable "sex"? 2-1-2 ===== A variable can have one role in one analysis or study and another role in a different study. The "modifier" role of a variable is not mentioned. 2-1-3 ===== See earlier comments on 2-1-1. Interval variables which are measured on a ratio scale are treated the same way statistically as interval variables which are not. The seventh paragraph (ratio-scale variables often involve ...) requires a number of comments. First, the variations may not be due to measurement errors; they may be perfectly legitimate biological variations -- within the same unit or between units. Second, this "non-normal" pattern of variation (that "violates" an important assumption of linear regression) is a NON-ISSUE if the variable is used as a determinant. After all, we "tolerate" BINARY regressor ("independent") variables (such as gender, coded 0 and 1) that don't even have a continuous scale, let alone have an "equal variance" or "Gaussian variance" distribution. Third, even if the variable in question is the "dependent" or "response" or "outcome" variable, the non-Gaussian and non-homogeneous variation may not be critical -- it depends on (1) sample size, and the central limit theorem (2) whether the focus is on the MEAN response or on individual responses and (3) whether one is speaking of the variation IRRESPECTIVE of which "X" value it is associated with, or the variation of observations within a "cell" where all of the "X" variables have the same value. You could imagine that heights of ADULTS would certainly have a non-Gaussian distribution, while heights of adults MALES and heights of adult FEMALES might well have close-to-Gaussian distributions. Fourth, the terms "Gaussian" and "non-Gaussian" are preferred to "normal" and "non-normal". It is perfectly "normal" (in the French sense of the word) that heights of adult males -- in an "epidemiologically clean" (ie homogeneous) population -- might show a Gaussian pattern of variation, whereas the accompanying weights -- being more "elective" and self-determined -- would not. 2-2 Don't get too fussed about this. 2-3. Footnote to Table 2-1 ("control" variable). ================================================= You will see this terminology used a lot in other disciplines. The way it is described here, it implies a "confounding variable" that, unless taken into account, "distorts" or "biases" or confounds the primarily relationship of interest. But it could also be that the question was, in an RCT, whether physical activity has short term effects on blood pressure. One could arrange it so that the contrasted exercise-level groups were "balanced" with respect to (had the same average) age and gender. In this instance, age and gender could have other roles, such as serving to reduce the variability of the responses, or to indicate subgroups in whom the exercise - no exercise contrast produces different answers (ie age and gender would be modifiers of the exercise - BP relationship). Table 2-2. ========== It is interesting to see the phrases used to describe the purpose of the analysis, eg. "the relationship between" Y and X1, X2, etc; to what extent the X's "are related to" the probability that Y is 1; the "relationship between" Y (rate) and the independent variables; to compare outcomes, "adjusting for" some Xs; whether responses in one racial group are higher than in another, "after controlling for" age ...; whether categories of X (race) have an effect on the difference score; "describe the relationship between" Y and X. While one wishes to be cautious about causal inference from these (all non-experimental) studies, one could be a bit more directional in one's description. If one speaks of "how A is related to B", or of "the relationship between A and B", one doesn't immediately know [unless substantively] whether the "direction" is "A->B" or "B->A". If you want to avoid giving the impression that your choice of words is overly "causal", you can still use words that show the DIRECTION of the arrow. For example, you could use the word "determinant" for X, i.e. "X as a determinant of Y" is strictly descriptive. It says levels of y are different at ("predictable from") different levels of X. For example, in the cross-sectional Busselton anthropometric study, or in the study of Quebec millers and miners of asbestos born between 1890 and 1910, one can see a relationship "BETWEEN height and age". So one can speak of AGE being a "DETERMINANT OF" HEIGHT, just like you can say "number of publications determines salary". the meaning is in a strict mathematical sense only. Just as you wouldn't arbitrarily chose whether the Y variable should go on the vertical or horizontal axis, words like "between" Y and X, or (X and Y) should not be used to describe Y vis-a-vis X. Likewise, if you are willing to be a bit more daring, why not write the role of "maternal smoking in the etiology of low birth weight" rather than the neutral "smoking and low birth weight" or "low birth weight and smoking".