Course 678 Web Page

621 'learn SAS project' january 2003

The purpose of the project is to gain experience with SAS.

The application is to investigate different definitions of Body Mass Index (BMI).

BACKGROUND

With weight measured in Kilograms, and height in metres, BMI is usually defined as weight divided by the SQUARE of height, i.e.,

BMI = Wt / (Height*Height)

or

BMI = Wt/(height**2) using, as SAS and several other programming languages do, the symbol ** for 'raised to the power of'. [ sorry about my earlier mention of ^ as power.. I was thinking of Excel)

What's special about the power of 2?

Why not a power of 1 i.e., Weight/height?

Why not 3, i.e., Weight/*(height**3) ?

Why not 2.5 i.e. Weight/(height**2.5)?

One of the statistical aims of a transformation of weight to BMI is that BMI be statistically less correlated with height, thereby separating height and height into two more useful components height and BMI. For example in predicting lung function (e.g. FEV1), it makes more sense to use height and BMI than height and weight, since weight has 2 components in it -- it is partly height and partly BMI. Presumably, one would choose the power which minimizes the correlation.

The task in this project is to investigate the influence of the power of height used in the ratio, and to see if the pattern of correlations with power is stable over different settings (datasets).

DATA

To do this, you can obtain your own datasets or use the 6 datasets on the 678 page:

- Children aged 11-16 Alberta 1985 (under 'Datasets')

- 18 year olds in Berkeley longitudinal study, born 1928/29 (under 'Datasets')

- Dataset on bodyfat -- 252 men (see documentation) (under 'Projects' or 'Datasets')

- Pulse Rates before and after Exercise -- Australian undergraduates in 1990's (under 'Projects')

- Miss America dataset 1921-2000 (under 'Resources')

- Playboy dataset 1929-2000 (under 'Resources')

'DELIVERABLE' from each pair of students,
by END OF 621 CLASS on Friday January 24.

1.

A graph, showing, on the x-axis, the powers 1.0 to 3.0 in steps of 0.25 [ or finer if you want to be compulsive ] and on the y-axis the correlation between the 'BMI' obtained with each of these powers, and height. The correlations from each of the 6 datasets should be shown with a different symbol or colour. Graph should be made using SAS [hint: (manually if you can't do so otherwise) make a separate SAS dataset with 54 observations .. i.e. 9 (power,correlation) datapoints for each of 6 datasets].

2.

A one paragraph summary of your findings/conclusions.

3.

Listing of SAS programs used and logs produced (as an appendix).

At least 1 of the 6 datasets should be created by 'importing', 1 read in as 'LINES' directly in the program, 1 created using an INFILE statement that points to an external 'ASCII' file, and 1 created as a permanent dataset.

Question from/ answers to one student...

I am unclear about what it is I am supposed to graph for the SAS project. Is it the correlation between height and BMI (on the y-axis), versus the power of the height (on the x-axis)? YES

If so, how do we go about programming it into SAS?

There are fancier ways that you might want to use later on so as to avoid a lot of duplication and dangerous manual extraction.. much better to automate everything that to go cutting and pasting .. but for now the way indicated on the right will get the job done without getting into too much SAS gymnastics ..

However, you might want to use of the following little bit of organization even now..

to avoid having 6 SEPARATE PROGRAMS

(Imagine if at the end you discover have made the same logic mistake, and you now have to go back and change it in 6 places.. !)
Somewhat tedious but..

how about extracting the 9 (correlation,power) pairs and making them into 9 rows (observations) and 2 columns (variables) ..

do same for next dataset etc until you have a file with 54 rows and 2 columns ie 'stack' the pairs under each other

(actually, put a third column i.e., a variable showing which dataset each correlation is from, to be used a s a plotting symbol..)

then read them these into a new SAS dataset.

eg

DATA corrlns;

INPUT source power corr;
LINES; (or bring the points in from excel or a notepad file)

54 lines of data [ 3 items per line]

;

PROC PLOT DATA=corrlns;
PLOT corr*power = source;

RUN;

DATA dset1; KEEP dset_id ht wt;
INFILE ...;
INPUT ... ;
dset_id=1;
ht = whatever.. ;
wt = whatever ;
RUN;

DATA dset2; KEEP dset_id ht wt;
INFILE ...;
INPUT ... ;
dset_id=2;
ht = ;
wt = ;
RUN;

etc..

Then stack them all together into one dataset...

DATA all;

SET dset1 dset2 dset3 dset4 dset5 dset5 dset6;

compute all the bmi's;

RUN;

PROC SORT DATA = all; BY dset_id;
RUN;

PROC CORR DATA = all; BY dset_id;
var ht;
WITH bmi100 bmi125 ... ;
RUN;

Think of 'SET' as 'READ from'

When you list existing sas datasets in a SET statement, SAS will read in the records from the 1st one until it gets to the end, then continue from the top to the end of the next one, and so on until it reaches the end of the last one.

As with creating any dataset in SAS, one can act on the variables being brought in e.g. to create new ones from them, or to use one or more of them to decide whether to include or exclude the newly created record.

Up to now, I performed a correlation between the different BMI's with height (in metres). (i.e. var bmi1 bmi2 ... ht_m). Is this correct?

What's the next step? YES

See above

I have seen another group's graph, and they have achieved a negative slope (straight line). However in class, you mentionned that it was a quadratic relationship. Therefore, are all datasets supposed to show a parabola, instead of a straight line?

BY THE WAY... don't believe every correlation you see.. it might be worth first doing a quick plot of height vs weight (or PROC UNIVARIATE) to see if the source data points make sense...

If you find wild values, and you cannot fix them, you can exclude them from correlation (or any other!) calculations using the WHERE statement inside the PROC

e.g.,

PROC CORR data = ... ;

WHERE (values are reasonable** );

Var height with bmi100 bmi125 .. etc ;

RUN;

** for example...

using HEIGHT > xx.x and HEIGHT < x.xx You are correct.. they will probably go from positive to zero to negative...

When I drew the plot on the board in class, I was thinking of positive and negative correlations in the same way, ie I was thinking of their absolute values. If we think about the correlation and its sign, we should be looking at where it is close to zero (positive or negative)

I havent given this exercise before, so I dont know the answer .. that is why I asked the question .. how robust and universal is this power of 2 ..

If interested in this issue, you could search the web for Quetelet, the 19th century scientist for whom the BMI index is named

The efforts to come up with a simple prediction equation for body surface area (BSA) are also of interest.

http://www.halls.md/body-surface-area/bsa.htm

The purpose of the project is to gain experience with SAS.

The application is to investigate different definitions of Body Mass Index (BMI).

BACKGROUND

With weight measured in Kilograms, and height in metres, BMI is usually defined as weight divided by the SQUARE of height, i.e.,

BMI = Wt / (Height*Height)

or

BMI = Wt/(height**2) using, as SAS and several other programming languages do, the symbol ** for 'raised to the power of'. [ sorry about my earlier mention of ^ as power.. I was thinking of Excel)

What's special about the power of 2?

Why not a power of 1 i.e., Weight/height?

Why not 3, i.e., Weight/*(height**3) ?

Why not 2.5 i.e. Weight/(height**2.5)?

The task in this project is to investigate the influence of the power of height used in the ratio, and to see if the pattern of correlations with power is stable over different settings (datasets).

DATA

To do this, you can obtain your own datasets or use the 6 datasets on the 678 page: - Children aged 11-16 Alberta 1985 (under 'Datasets')

- 18 year olds in Berkeley longitudinal study, born 1928/29 (under 'Datasets')

- Dataset on bodyfat -- 252 men (see documentation) (under 'Projects' or 'Datasets')

- Pulse Rates before and after Exercise -- Australian undergraduates in 1990's (under 'Projects')

- Miss America dataset 1921-2000 (under 'Resources')

- Playboy dataset 1929-2000 (under 'Resources')

'DELIVERABLE' from each pair of students, by END OF 621 CLASS on Friday January 24.

1.

2.

A one paragraph summary of your findings/conclusions.

3.

Listing of SAS programs used and logs produced (as an appendix).

At least 1 of the 6 datasets should be created by 'importing', 1 read in as 'LINES' directly in the program, 1 created using an INFILE statement that points to an external 'ASCII' file, and 1 created as a permanent dataset.

Question from/ answers to one student...

how about extracting the 9 (correlation,power) pairs and making them into 9 rows (observations) and 2 columns (variables) ..

do same for next dataset etc until you have a file with 54 rows and 2 columns ie 'stack' the pairs under each other (actually, put a third column i.e., a variable showing which dataset each correlation is from, to be used a s a plotting symbol..)

then read them these into a new SAS dataset.

eg

DATA corrlns;

INPUT source power corr; LINES; (or bring the points in from excel or a notepad file)

54 lines of data [ 3 items per line]

;

PROC PLOT DATA=corrlns; PLOT corr*power = source;

RUN;

BMI = Wt/(height2) using, as SAS and several other programming languages do, the symbol for 'raised to the power of'. [ sorry about my earlier mention of ^ as power.. I was thinking of Excel)

To do this, you can obtain your own datasets or use the 6 datasets on the 678 page:

- Children aged 11-16 Alberta 1985 (under 'Datasets')

'DELIVERABLE' from each pair of students,
by END OF 621 CLASS on Friday January 24.

do same for next dataset etc until you have a file with 54 rows and 2 columns ie 'stack' the pairs under each other

(actually, put a third column i.e., a variable showing which dataset each correlation is from, to be used a s a plotting symbol..)

INPUT source power corr;
LINES; (or bring the points in from excel or a notepad file)

PROC PLOT DATA=corrlns;
PLOT corr*power = source;