Oct 20 [2005] See revised entry re midpoint
for q 5 below
Oct 19, 2005
We have a few questions regarding the assignment.
1- In question 1, we found the correlations to be very different between Alberta
and Berkeley. In other words, the smallest correlation coefficient was with the 1.5
power BMI for the Berkeley and the 2.0 BMI for the Alberta? Our programming seems
sound. Is this caused by the size of the sample we used for the Berkeley dataset?
I expect it is the difference in age of the two samples .. if I'm not mistaken,
Alberta subjects are still children, and not fully grown. May also find different
powers for models and ballerinas and football players
3- In question 2.2, we feel that the heights of fathers and mothers are by nature
independent, (?? People self-select, and maybe tend to choose people near their
own height.. so most of us expected some (positive) correlation) and therefore
the sum of individual variances should be equal to the variance of the sum of these
2 variables?
Indeed, if there were no correlation (a little weaker than independence), then
yes the variances should simply add.. I think that you will find that the variance
of the sum is a bit bigger than the sum of the variances, so there is some small
amount of positive covariance (correlation).
I think the answer you get is close to no correlation… or is it the result that
should decide whether the variables should be called independent or not? You can
let the data tell you.. (By the way, you could have data with a strange dependent
pattern, but where the correlation is zero). The correlation coefficient only picks
up straight line relations .. if you had a U shape, or inverted U shape, relation,
the Pearson correlation could be very close to zero.
4- In question 2.6, we were wondering whether we
should ignore the fact that there are many missing heights in the offspring dataset
(not necessarily missing, but rather, they are often qualitative data rather than
numbers)? YES.. and if you read them in as numbers, SAS will put them to missing..
I suppose we could give a value to “medium” but it is going too far for our purposes..
the main point is that data often have some missing values.. c'est la vie.
5- In question 2.9, does midpoint mean (65.5+64.5)/2=65 ???
Revised answer on Oct 20
: if you chop the fractional part of a midparent of 68.2 or one of 68.8, you have
68. So the integer 68 represents the interval 68-69, of which 68.5 is the mid-point.
Can get place a mid_parent of 68.2 or 68.8 at the midpoint 68.5 by creating a variable
such as
midpoint = FLOOR(mid_parent) + 0.5;
We have the mid-point of the
INTERVAL 68-69, AND the mid-PARENT-- two different concepts.
How do we use proc univariate here? Do we apply class to the binned parental height
YES
or does class bin the parental height for us?
NO; CLASS takes whatever variable you tell it
is the class variable, and does the statistics for the subgroups with different values
of that variable. Think of the statement "CLASS specified_variable" as
"do subgroup analyses , dividing subjects based on specified_variable ..."
See example in class on "Exploring
Data" in UCLA series.
With the class statement,
we get the descriptive statistics broken down by prgtype
proc means data='c:\sas\hs0'
n mean median std var;
class prgtype;
var read math science write;
run;
we can also use CLASS with PROC UNIVARIATE.
You want the medians for the observations in each 'bin' (row in Galton's table)
so it will be [ you need to have already created a variable called say 'mid_parent_height_integer
' variable with the FLOOR function..
eg mid_parent_height_integer = FLOOR(mid_parent_height); ]
proc UNIVARIATE data='c:\sas\hs0';
class mid_parent_height_integer;
var offspring_height ;
run;
the way to save these stats for plotting in next q. is to use an OUTPUT statement
in PROC UNIVARIATE so you can make a new dataset with each value of mid_parent_height_integer
and the median of the heights of the offspring in that bin.
PROC UNIVARIATE ... ;
CLASS mid_parent_height_integer;
OUTPUT OUT=SAS-data-set
keyword = name ; -- keyword here is median -- name is whatever you want to call it
VAR offspring_height ;
“10. Plot the individual unisex offspring heights
(daughters additively transmuted) versus the mid-parent height (mothers transmuted).
OVERLAY on it, with a different plotting symbol, the corresponding plot involving
the additively transmuted offspring values (on the parent-axis, stay with
Galton's definition of a midparent).”
Are these two the same? YES YOU ARE RIGHT… THEY ARE!
My mistake… for the overlay, I meant to ask that you plot the multiplicative
version of the offspring. Thanks for spotting this and telling me..
So it is something like..
PROC PLOT data = … ;
Plot offspring_height_add * midparent
offspring_height_mult * midparent / OVERLAY;
Also, do you mean additive or multiplicative in this case?
I meant one of each..
|