Q1


This exercise investigates different definitions of
Body Mass Index (BMI).
BACKGROUND: With weight measured in Kilograms, and height in metres, BMI is usually
defined as weight divided by the SQUARE of height, i.e., BMI = Wt / (Height*Height),
or BMI = Wt/(height**2) using, as SAS and several other programming languages do,
the symbol ** for 'raised to the power of'. [ NB: Excel uses ^ to denote this ]
What's special about the power of 2? Why not a power of 1 i.e., Weight/height?
Why not 3, i.e., Weight/*(height**3) ? Why not 2.5 i.e. Weight/(height**2.5)?
One of the statistical aims of a transformation of weight and height to BMI is that
BMI be statistically less correlated with height, thereby separating height and height
into two more useful components height and BMI. For example in predicting lung function
(e.g. FEV1), it makes more sense to use height and BMI than height and weight, since
weight has 2 components in it  it is partly height and partly BMI. Presumably,
one would choose the power which minimizes the correlation.
The task in this project is to investigate the influence of the power of height used
in the ratio, and to see if the pattern of correlations with power is stable over
different settings (datasets).
DATA: To do this, use 2 of the 6 datasets on the 678 webpage:
[usernane is c678 and p w is H***J*44 ]
 Children aged 1116 Alberta 1985 (under 'Datasets')
 18 year olds in Berkeley longitudinal study, born 1928/29 (under 'Datasets')
 Dataset on bodyfat  252 men (see documentation) (under 'Datasets')
 Pulse Rates before and after Exercise  Australian undergraduates in 1990's (under
'Projects')
 Miss America dataset 19212000 (under 'Resources')
 Playboy dataset 19292000 (under 'Resources')
METHODS: First create each of the two SAS datasets, and if height and weight are
not already in metres and Kg, convert them to these units. Drop any irrelevant variables.
Inside each dataset, create a variable giving the source of the data (we will merge
the two  and eventually all six datasets, so we need to be able to tell which
one each observation came from).
Combine the two datasets, i.e. 'stack' them one above the other in a single dataset.
Print out some excerpts.
For each subject in the combined dataset, create 5 versions of <<BMI> using
the powers 1, 1.5, 2, 2.5 and 3.
Calculate the correlation between the 'BMI' obtained with each of these powers, and
height. Do this separately for the observations from the two different sources (the
BY statement should help here).
Report your CONCLUSIONS. 
Q2


The objective of this exercise is to examine the
relation between heights of parents and heights of their (adult) children, using
recently 'uncovered' data from the Galton archives, You are asked to assess if Galton's
way of dealing with the fact that heights of males and females are quite different
produces sharper correlations than we would obtain using 'modern' methods of dealing
with this fact. As side issues, you are also asked to see whether the data suggest
that stature plays "a sensible part in marriage selection" and to comment
on the correlations of the heights in the 4 {father,son}, {father,daughter}, {mother,son}
and {mother,daughter} pairings.
BACKGROUND: Galton 'transmuted' female heights into their 'maleequivalents' by multiplying
them by 1.08, and then using a single combined 'unisex' dataset of 900something
offspring and their parents. While some modernday anayysts would simply calculate
separate correlations for the male and female offspring (and then average the two
correlations, as in a metaanalysis), most would use the combined dataset but 'partial
out' the malefemales differences using a multivariable analysis procedure. The various
multivariable procedures in effect create a unisex dataset by adding a fixed number
of inches to each female's height (or, equivalently, in the words of one of our female
PhD students, by 'cutting the men down to size'). JH was impressed by the more elegant
'proportional scaling' in the 'multiplicative model' used by Galton, compared with
the 'just use the additive models most readiliy available in the software' attitude
that is common today. In 2001, he located the raw (untransmuted) data that allows
us to compare the two approaches.
DATA: For the purposes of this exercise, the data [see http://www.epi.mcgill.ca/hanley/galton
] are in two separate files:
 the heights# of 205 sets of parents ( parents.txt ) with families numbered 1135, 136A, 136204
 the heights# of their 900something* children ( offspring.txt )
with families numbered as above
* The data on eight families are deliberately omitted, to entice the scholar in you
to get into the habit of looking at (and even double checking) the original data.
Since here we are more interested
in the computing part in this course, and because time is short, ignore this invitation
to inspect the data  we already had a look at them in class. In practice, we often
add in 'missing data' later, as there are always some problem cases, or lab tests
that have to be repeated, or values that need to be checked, or subjects who didn't
get measured at the same time as others etc.. JH's habit is to make the additions
in the 'source' file (.txt or .xls or whatever) and rerun the entire SAS DATA step(s)
to create the updated SAS dataset (temporary or permanent). If the existing SAS datset
is already large, and took a lot of time to create, you might consider creating a
small dataset with the new observations, and then stacking (using SE) the new one
under the existing one  in a new file. SAS has fancier ways too, and others may
do things differently!
# If your connection is too slow to view the photo of the first page of the Notebook,
the title reads
FAMILY HEIGHTS
(add 60 inches to every entry in the Table)
METHODS/RESULTS/COMMENTS:
1. Categorize each father's height into one of 3 subgroups (shortest 1/4, middle
1/2, tallest 1/4). Do likewise for mothers. Then, as Galton did [ Table III ], obtain
the 2way frequency distribution and assess whether "we may regard the married
fold as picked out of the general population at haphazard".
2. Calculate the variance Var[F] and Var[M] of the fathers' [F] and mothers' [M]
heights respectively. Then create a new variable consisting of the sum of F and M,
and calculate Var[F+M]. Comment. Galton called this a "shrewder" test than
the "ruder" one he used in 1. ( statistickeyword VAR in PROC MEANS)
3. When Galton first anayzed these data in 18851886, Galton and Pearson hadn't yet
invented the CORRelation coefficient. Calculate this coefficient and see how
it compares with your impressions in 1 and 2.
4. Create two versions of the transmuted mother's heights, one using Galton's and
one using the modernday (lazyperson's, blackbox?) additive scaling [for the latter,
use the observed difference in the average heights of fathers and mothers, which
you can get by e.g., running PROC MEANS on the offspring dataset, either BY gender, or using gender as a CLASS variable]. In which version of the transmuted mothers' heights is
their SD more simlar to the SD of the fathers? ( statistickeyword STD in PROC
MEANS)
5. Create the two corresponding versions of what Galton called the 'midparent' (ie
the average of the height of the father and the height of the transmuted mother).
Take midpoint to mean the halfway
point (so in this case the average of the two)
6. Create the corresponding two versions (additive and multiplicative scaling) of
the offspring heights (note than sons' heights remain 'as is'). Address again, but
now for daughters vs sons, the question raised at the end of 4.
7. Merge the parental and offspring datasets created in steps 4 and 6, taking care
to have the correct parents matched with each offspring (this is called a 1:many
merge).
8. Using the versions based on 1.08, round the offspring and midparent heighs to
the nearest inch (or use the
FLOOR function to just keep the integer part of the midparent height you need not be as fussy as Galton was about the groupings
of the midparent heights), and obtain a 2way frequency distribution similar to
that obtained by Galton [ Table
I ]. Note that, opposite to we might do
today, Galton put the parents on the vertical, and the offspring on the horizontal
axis. ( The MOD INT FLOOR CEIL and ROUND functions can help you map observations
into 'bins' ; we will later see a way to do so using loops)
9. Galton called the offspring in the same row of his table a 'filial array'. Find
the median height for each filial array, and plot it, as Galton did, against the
midpoint of the interval containing their midparent  you should have one datapoint
for each array*. Put the midparent values on the vertical, and the offspring on
the horizontal axis. By eye, estimate the slope of the line of best fit to the datapoints.
Mark your fitted line by 'manually' inserting two markers at the opposite corners
of the plot. Does the slope of your fitted line agree with Galton's summary of the
degree of "regression to mediocrity"? [ Plate IX ] *Note
that Galton used datapoints for just 9 filial arrays, choosing to omit those in the
bottom and top rows (those with the very shortest and the very tallest parents) because
the data in these arrays were sparse. ( By using the binned parental height in
the CLASS statement in PROC MEANS or PROC UNIVARIATE, directing the output to a new
SAS dataset, and applying PROC PLOT to this new dataset, you can avoid having to
do the plotting manually See
more on this in the FAQ)
10. Plot the individual unisex offspring heights (daughters additively transmuted)
versus the midparent height (mothers transmuted). OVERLAY on it, with a different
plotting symbol, the corresponding plot involving the multiplicatively transmuted
offspring values (on the parentaxis, stay with Galton's definition of a midparent).
(see FAQ)
Compare the two, and have a look at Galton's fitted ellipse, corresponding to a bivariate
normal distribution [ Plate
X ]) {here, again, we would be more likely
to plot the parents' heights on the horizontal, and the offspring heights on the
vertical axis}.
11. For each of the following 'offspring vs. midparent' correlations, use the 'midparent'
obtained using Galton's multiplicative method. Calculate (a) the 2 correlations for
the 2 unisex versions of the offspring data (b) the sexspecific correlations (i.e.,
daughters and sons separately) and (c) the single parentoffspring correlation, based
on all offspring combined, and their untransmuted heights, ignoring the sex of the
offspring. Comment on the correlations obtained, and on the instances where there
are big disparities between them. [ a PLOT, with separate plotting symbols for
sons and daughters, might help in the case of (c) ]
12. Calculate the 4 correlations (i) father,son (ii) father,daughter, (iii) mother,son
and (iv) mother,daughter. Comment on the pattern, and on why you think it turned
out this way.
