Course 613.. Assignments

McGill University, Department of Epidemiology, Biostatistics and Occupational Health

EPIB 613: Introduction to Statistical Software (Fall 2006)

Assignment 3, due October 30

Working in teams of two or three...


This exercise investigates different definitions of Body Mass Index (BMI).

BACKGROUND: With weight measured in Kilograms, and height in metres, BMI is usually defined as weight divided by the SQUARE of height, i.e., BMI = Wt / (Height*Height), or BMI = Wt/(height**2) using, as SAS and several other programming languages do, the symbol ** for 'raised to the power of'. [ NB: Excel uses ^ to denote this ]

What's special about the power of 2? Why not a power of 1 i.e., Weight/height?

Why not 3, i.e., Weight/*(height**3) ? Why not 2.5 i.e. Weight/(height**2.5)?

One of the statistical aims of a transformation of weight and height to BMI is that BMI be statistically less correlated with height, thereby separating height and height into two more useful components height and BMI. For example in predicting lung function (e.g. FEV1), it makes more sense to use height and BMI than height and weight, since weight has 2 components in it -- it is partly height and partly BMI. Presumably, one would choose the power which minimizes the correlation.

The task in this project is to investigate the influence of the power of height used in the ratio, and to see if the pattern of correlations with power is stable over different settings (datasets).

DATA: To do this, use 2 of the 6 datasets on the
678 webpage:
[usernane is c678 and p w is H***J*44 ]

- Children aged 11-16 Alberta 1985 (under 'Datasets')
- 18 year olds in Berkeley longitudinal study, born 1928/29 (under 'Datasets')
- Dataset on bodyfat -- 252 men (see documentation) (under 'Datasets')
- Pulse Rates before and after Exercise -- Australian undergraduates in 1990's (under 'Projects')
- Miss America dataset 1921-2000 (under 'Resources')
- Playboy dataset 1929-2000 (under 'Resources')

METHODS: First create each of the two SAS datasets, and if height and weight are not already in metres and Kg, convert them to these units. Drop any irrelevant variables. Inside each dataset, create a variable giving the source of the data (we will merge the two -- and eventually all six-- datasets, so we need to be able to tell which one each observation came from).

Combine the two datasets, i.e. 'stack' them one above the other in a single dataset. Print out some excerpts.

For each subject in the combined dataset, create 5 versions of <<BMI> using the powers 1, 1.5, 2, 2.5 and 3.

Calculate the correlation between the 'BMI' obtained with each of these powers, and height. Do this separately for the observations from the two different sources (the BY statement should help here).

Report your CONCLUSIONS.





The objective of this exercise is to examine the relation between heights of parents and heights of their (adult) children, using recently 'uncovered' data from the Galton archives, You are asked to assess if Galton's way of dealing with the fact that heights of males and females are quite different produces sharper correlations than we would obtain using 'modern' methods of dealing with this fact. As side issues, you are also asked to see whether the data suggest that stature plays "a sensible part in marriage selection" and to comment on the correlations of the heights in the 4 {father,son}, {father,daughter}, {mother,son} and {mother,daughter} pairings.

BACKGROUND: Galton 'transmuted' female heights into their 'male-equivalents' by multiplying them by 1.08, and then using a single combined 'uni-sex' dataset of 900-something offspring and their parents. While some modern-day anayysts would simply calculate separate correlations for the male and female offspring (and then average the two correlations, as in a meta-analysis), most would use the combined dataset but 'partial out' the male-females differences using a multivariable analysis procedure. The various multivariable procedures in effect create a unisex dataset by adding a fixed number of inches to each female's height (or, equivalently, in the words of one of our female PhD students, by 'cutting the men down to size'). JH was impressed by the more elegant 'proportional scaling' in the 'multiplicative model' used by Galton, compared with the 'just use the additive models most readiliy available in the software' attitude that is common today. In 2001, he located the raw (untransmuted) data that allows us to compare the two approaches.

DATA: For the purposes of this exercise, the data [see ] are in two separate files:

- the heights# of 205 sets of parents (
parents.txt ) with families numbered 1-135, 136A, 136-204
- the heights# of their 900-something* children (
offspring.txt ) with families numbered as above

* The data on eight families are deliberately omitted, to entice the scholar in you to get into the habit of looking at (and even double checking) the original data.
Since here we are more interested in the computing part in this course, and because time is short, ignore this invitation to inspect the data -- we already had a look at them in class. In practice, we often add in 'missing data' later, as there are always some problem cases, or lab tests that have to be repeated, or values that need to be checked, or subjects who didn't get measured at the same time as others etc.. JH's habit is to make the additions in the 'source' file (.txt or .xls or whatever) and re-run the entire SAS DATA step(s) to create the updated SAS dataset (temporary or permanent). If the existing SAS datset is already large, and took a lot of time to create, you might consider creating a small dataset with the new observations, and then stacking (using SE) the new one under the existing one -- in a new file. SAS has fancier ways too, and others may do things differently!

# If your connection is too slow to view the photo of the first page of the Notebook, the title reads

(add 60 inches to every entry in the Table)


1. Categorize each father's height into one of 3 subgroups (shortest 1/4, middle 1/2, tallest 1/4). Do likewise for mothers. Then, as Galton did [
Table III ], obtain the 2-way frequency distribution and assess whether "we may regard the married fold as picked out of the general population at haphazard".

2. Calculate the variance Var[F] and Var[M] of the fathers' [F] and mothers' [M] heights respectively. Then create a new variable consisting of the sum of F and M, and calculate Var[F+M]. Comment. Galton called this a "shrewder" test than the "ruder" one he used in 1. ( statistic-keyword VAR in PROC MEANS)

3. When Galton first anayzed these data in 1885-1886, Galton and Pearson hadn't yet invented the CORRelation coefficient. Calculate this coefficient and see how it compares with your impressions in 1 and 2.

4. Create two versions of the transmuted mother's heights, one using Galton's and one using the modern-day (lazy-person's, blackbox?) additive scaling [for the latter, use the observed difference in the average heights of fathers and mothers, which you can get by e.g., running PROC MEANS on the offspring dataset,
either BY gender, or using gender as a CLASS variable]. In which version of the transmuted mothers' heights is their SD more simlar to the SD of the fathers? ( statistic-keyword STD in PROC MEANS)

5. Create the two corresponding versions of what Galton called the 'mid-parent' (ie the average of the height of the father and the height of the transmuted mother).
Take mid-point to mean the half-way point (so in this case the average of the two)

6. Create the corresponding two versions (additive and multiplicative scaling) of the offspring heights (note than sons' heights remain 'as is'). Address again, but now for daughters vs sons, the question raised at the end of 4.

7. Merge the parental and offspring datasets created in steps 4 and 6, taking care to have the correct parents matched with each offspring (this is called a 1:many merge).

8. Using the versions based on 1.08, round the offspring and mid-parent heighs to the nearest inch
(or use the FLOOR function to just keep the integer part of the mid-parent height --you need not be as fussy as Galton was about the groupings of the mid-parent heights), and obtain a 2-way frequency distribution similar to that obtained by Galton [ Table I ]. Note that, opposite to we might do today, Galton put the parents on the vertical, and the offspring on the horizontal axis. ( The MOD INT FLOOR CEIL and ROUND functions can help you map observations into 'bins' ; we will later see a way to do so using loops)

9. Galton called the offspring in the same row of his table a 'filial array'. Find the median height for each filial array, and plot it, as Galton did, against the midpoint of the interval containing their midparent -- you should have one datapoint for each array*. Put the mid-parent values on the vertical, and the offspring on the horizontal axis. By eye, estimate the slope of the line of best fit to the datapoints. Mark your fitted line by 'manually' inserting two markers at the opposite corners of the plot. Does the slope of your fitted line agree with Galton's summary of the degree of "regression to mediocrity"? [
Plate IX ] *Note that Galton used datapoints for just 9 filial arrays, choosing to omit those in the bottom and top rows (those with the very shortest and the very tallest parents) because the data in these arrays were sparse. ( By using the binned parental height in the CLASS statement in PROC MEANS or PROC UNIVARIATE, directing the output to a new SAS dataset, and applying PROC PLOT to this new dataset, you can avoid having to do the plotting manually See more on this in the FAQ)

10. Plot the individual unisex offspring heights (daughters
additively transmuted) versus the mid-parent height (mothers transmuted). OVERLAY on it, with a different plotting symbol, the corresponding plot involving the multiplicatively transmuted offspring values (on the parent-axis, stay with Galton's definition of a midparent). (see FAQ) Compare the two, and have a look at Galton's fitted ellipse, corresponding to a bivariate normal distribution [ Plate X ]) {here, again, we would be more likely to plot the parents' heights on the horizontal, and the offspring heights on the vertical axis}.

11. For each of the following 'offspring vs. mid-parent' correlations, use the 'mid-parent' obtained using Galton's multiplicative method. Calculate (a) the 2 correlations for the 2 unisex versions of the offspring data (b) the sex-specific correlations (i.e., daughters and sons separately) and (c) the single parent-offspring correlation, based on all offspring combined, and their untransmuted heights, ignoring the sex of the offspring. Comment on the correlations obtained, and on the instances where there are big disparities between them. [ a PLOT, with separate plotting symbols for sons and daughters, might help in the case of (c) ]

12. Calculate the 4 correlations (i) father,son (ii) father,daughter, (iii) mother,son and (iv) mother,daughter. Comment on the pattern, and on why you think it turned out this way.


Put all of the program steps and output into a single .txt file. JH will use a mono-spaced font such as Courier to view it -- that way the alignment should be ok. Interleave DATA and PROC statements with output and conclusions, and use helpful titles (produced by SAS, but to your specifications) over top of each output. Get SAS to set up the output so that there are no more that 65 horizontal characters per line -- that way, lines won't wrap-around even when the font used to view your file is increased. Show relevant excerpts rather than entire listings of datafiles. Annotate liberally. Submit the text file electronically (i.e., by email) to JH by 9 am on Monday October 30.

(updated Oct 4, 2006)