513-613: EPIB-613: Statistical Software (Fall 2006)

McGill University, Department of Epidemiology and and Biostatistics
EPIB-613: Statistical Software (Fall 2006)

Frequently Asked Questions (FAQs)

last updated: March 21, 2008

MARCH 2008:

1. You might want to view (and print!) the 1- and 4-page REFERENCE CARDS for R (on main page).
2. The "LITTLE SAS BOOK: a primer", by Lora D. Delwiche and Susan J. Slaughter

The 3rd edition (2003) of this EXCELLENT book is available electronically to mcgill students and staff via (as many other books are!) the eBook service that our libraries subscribe to.

see http://www.books24x7.com/marc.asp?bookid=12222

If you go to the online library catalog, and search for sas, you will find several other SAS and non-SAS books; you register with the eBook people (as long as you are at mcgill.ca or in vpn) and sign in.

If you are comfortable with the interactive format in which the contents are presented via eBooks, just use this format to find what you are looking for. jh wanted to view the material on the bus or metro using hard copy, and so he bought the book, and also scanned the first six chapters and put them in the 613 website (main page).

This book is for those who want to read and see examples of what sas can do well.. not necessarily to learn sas at this point.. but to be aware of the capabilities, and to turn to seriously when -- and if -- the data for the thesis start to need this kind of data management and analysis.

The ucla site deals with and illustrates a few of the commonly used functions

The SAS Online Documentation has a complete listing, along with the details for each function.

    -> Base SAS
        -> SAS Language Reference: Dictionary            
            -> Dictionary of Language Elements
                -> Functions and CALL Routines
                    -> SAS Functions, by category

Question / Suggestion

Oct 20 [2005] See revised entry re midpoint for q 5 below

Oct 19, 2005

We have a few questions regarding the assignment.

1- In question 1, we found the correlations to be very different between Alberta and Berkeley. In other words, the smallest correlation coefficient was with the 1.5 power BMI for the Berkeley and the 2.0 BMI for the Alberta? Our programming seems sound. Is this caused by the size of the sample we used for the Berkeley dataset?

I expect it is the difference in age of the two samples .. if I'm not mistaken, Alberta subjects are still children, and not fully grown. May also find different powers for models and ballerinas and football players

3- In question 2.2, we feel that the heights of fathers and mothers are by nature independent, (?? People self-select, and maybe tend to choose people near their own height.. so most of us expected some (positive) correlation) and therefore the sum of individual variances should be equal to the variance of the sum of these 2 variables?

Indeed, if there were no correlation (a little weaker than independence), then yes the variances should simply add.. I think that you will find that the variance of the sum is a bit bigger than the sum of the variances, so there is some small amount of positive covariance (correlation).

I think the answer you get is close to no correlation…
or is it the result that should decide whether the variables should be called independent or not? You can let the data tell you.. (By the way, you could have data with a strange dependent pattern, but where the correlation is zero). The correlation coefficient only picks up straight line relations .. if you had a U shape, or inverted U shape, relation, the Pearson correlation could be very close to zero.

4- In question 2.6, we were wondering whether we should ignore the fact that there are many missing heights in the offspring dataset (not necessarily missing, but rather, they are often qualitative data rather than numbers)? YES.. and if you read them in as numbers, SAS will put them to missing.. I suppose we could give a value to “medium” but it is going too far for our purposes.. the main point is that data often have some missing values.. c'est la vie.

5- In question 2.9, does midpoint mean (65.5+64.5)/2=65 ???
Revised answer on Oct 20 : if you chop the fractional part of a midparent of 68.2 or one of 68.8, you have 68. So the integer 68 represents the interval 68-69, of which 68.5 is the mid-point.
Can get place a mid_parent of 68.2 or 68.8 at the midpoint 68.5 by creating a variable such as
midpoint = FLOOR(mid_parent) + 0.5

We have the mid-point of the INTERVAL 68-69, AND the mid-PARENT-- two different concepts.

How do we use proc univariate here? Do we apply class to the binned parental height YES

or does class bin the parental height for us?

NO; CLASS takes whatever variable you tell it is the class variable, and does the statistics for the subgroups with different values of that variable. Think of the statement "CLASS specified_variable" as "do subgroup analyses , dividing subjects based on specified_variable ..."

See example in class on "Exploring Data" in UCLA series.

With the class statement, we get the descriptive statistics broken down by prgtype

proc means data='c:\sas\hs0' n mean median std var;
class prgtype;
var read math science write;

we can also use CLASS with PROC UNIVARIATE.

You want the medians for the observations in each 'bin' (row in Galton's table)
so it will be [ you need to have already created a variable called say 'mid_parent_height_integer ' variable with the FLOOR function..

eg mid_parent_height_integer = FLOOR(mid_parent_height); ]

proc UNIVARIATE data='c:\sas\hs0';
class mid_parent_height_integer;
var offspring_height ;

the way to save these stats for plotting in next q. is to use an OUTPUT statement in PROC UNIVARIATE so you can make a new dataset with each value of mid_parent_height_integer and the median of the heights of the offspring in that bin.


CLASS mid_parent_height_integer;

OUTPUT   OUT=SAS-data-set  

  keyword = name ;   --  keyword here is median --  name is whatever you want to call it

VAR offspring_height ;

“10. Plot the individual unisex offspring heights (daughters additively transmuted) versus the mid-parent height (mothers transmuted). OVERLAY on it, with a different plotting symbol, the corresponding plot involving the additively transmuted offspring values (on the parent-axis, stay with Galton's definition of a midparent).”

these two the same? YES YOU ARE RIGHT… THEY ARE!

My mistake… for the overlay, I meant to ask that you plot the multiplicative version of the offspring. Thanks for spotting this and telling me..

So it is something like..

PROC PLOT data = … ;
 Plot   offspring_height_add  * midparent 
        offspring_height_mult * midparent / OVERLAY;

Also, do you mean additive or multiplicative in this case?

I meant one of each..

On 9/19/05 3:50 PM, ... wrote:

Hi Dr. Hanley,
I just have a quick question. For the assignment, question 2 when you refer to “the frequency distributions of 'age at death' in the hypothetical cohort, and in the 2001 population (for comparison purposes, the two distributions should be superimposed).” Are your referring to 2001 Canadian Males as the hypothetical cohort compared with the entire 2001 canadian population (males and females).

Reply: I should have been more specific. Since the point is to compare theoretical and actual, use males for both.
Sent: Mon 9/12/2005 6:26 PM
To: James Hanley, Dr.EPIB-613 Assigment 1 (question)

Hello Professor,

We have successfully completed part 1 of the assigment but part 2 has
been problematic. We can't seem to figure out what we have done wrong.
Would you mind taking a look at it?

Reply: Mon 9/12/2005 10:20 PM

The calculations need to be modified to reflect the fact that for the Canadian table, you are going one year at a time throughout. The American table went for 1 year, then 4 then every 5 thereafter..

look at the definitions of  q and L .. I expect that the 4's and 5's should all be 1's

If 1 yr at a time, them q should be very close to M, whereas if 5 years at a time,  q should be about 5 times M.

L should also reflect the person-years lived in a 1 year interval.

For explanations of the elements, see the lifetable material on the 681 web site
(use c681 as u ser na me and HanlJa44 as p a s s w o r d)

To save a step, and because the link to course 681 may not always work, I have put the key references (Bradford Hill, Selvin, and the details, and technical notes from the official publication on the US lifetables for 2000 and earlier) directly on the 613 website. If you are not able to get into c613 directly, I find that I can get there from some of my other course sites. By the way, I will soon be putting a 'lock' on the c613 website, just like the others.