Course 613.. Assignments

McGill University, Department of Epidemiology, Biostatistics and Occupational Health

EPIB 613: Introduction to Statistical Software (Fall 2006)

Assignment 2, due October 2, 2006

Working in teams of two or three...


Create 5 datasets from the questionnaire data on temperature perceptions etc.

(i) by importing directly from the Excel file, using the import wizard in SAS; or via the 'insheet' command in Stata (applied to .csv version of Excel file);

(ii) by first removing the first row (of variable names) and exporting the Excel file into a 'comma-separated-values" (.csv) text file, then ...

reading the data in this .csv file via the INFILE and INPUT statements in a SAS DATA step,


or via the 'infile' command in Stata

(iii) by reading the data in the text file temps_1.txt into

the SAS dataset via the INFILE and INPUT statements. Notice that the 'missing' values use the SAS representation (.) for missing values.

or the Stata dataset using the 'infile' command

(iv) by reading the data in the text file temps_2.txt via [in SAS] the INFILE and INPUT statements in a DATA step or [in Stata] the 'infix' command.

Here you will need to be careful, since 'free-format' will not work correctly (it is worth trying free format with this file, just to see what goes wrong!). When using the INFILE method, you can control some of the damage by using the 'MISSOVER' option in the INFILE statement: this keeps the INPUT statement from continuing on into the next data line in order to find the (in our example) 11 values implied by the variable list. JH uses this 'defensive' option in ALL of his INFILE statements.

(v) by cutting and pasting the contents of the text file temps_2.txt directly into the SAS or Stata program - in SASthe lines of data go immediately after the DATALINES statement, and there needs to be a line containing a semicolon to indicate the end of the data stream. In Stata, the lines of data go immediately after the infile or infix statement, and there needs to be a line containing the word 'end' to indicate the end of the data stream

This Cut and Paste Method is NOT RECOMMENDED when the number of observations is large, as it is too all too easy to inadvertently alter the data, and the SAS/Stata porogram becomes quite long and unwieldy. It is Good Data Management Practice to separate the program statements from the data.

[Run [in SAS] PROC MEANS [in Stata] the 'describe' command, on the numerical variables, and [in SAS] PROC FREQ or [in Stata] the 'tabulate' command, on the non-numerical variables, to check that the 5 datasets you created contain the same information. Also, get in the habit of viewing or printing several observations and checking the entries against the 'source'.

When using (i), have SAS show you the SAS statements generated by the wizard. Store these, and the DATA steps for (ii) to (v) in a single SAS program file (with suffix .sas).

Annotate liberally using comments:

in SAS, either begin with * ; or enclose with /* ... */

in Stata ..begin the line with * or place the comment between /* and */ delimiters or begin the comment with // or begin the comment with ///




  Use one of these 5 datasets, and the appropriate [in SAS, PROCs (see Exploring Data under UCLA SAS Class Notes 2.0)], or [in Stata, the list comamnd, and the analyses from the Statistics menu] to

(i) list the names and characteristics of the variables

(ii) list the first 5 observations in the dataset

(iii) list the id # and the responses just to q3, w5 and q6, for all respondents, with respondents in the order: female MDs, male MDs, female non-MDs, male non-MDs. Indicate the [sub-]statement that is required to reverse this order.

(iii) create a 2-way frequency table, showing the frequencies of respondents in each of the 2 (MD nonMD) x 2 (male female) = 4 'cells' (one defintion of an epidemiologist is 'an MD broken down by age and sex'). Turn off all the extra printed output, so that the table just has the cell frequencies and the row and column totals.

(iv) compare the mean and median attitude to exams in MDs vs. non-MDs (hint: in SAS, the CLASS statement may help). Get SAS/Stata to limit the output to just the 'n', the min, the max, the mean and the median for each subgroup. And try to also get it to limit the number of decimal places of output (in SAS the MAXDEC option is implememnted in some procedures, but as far as JH can determine not in all)

(v) compare the mean temperature perceptions (q6) of male and female respondents

(vi) [in SAS] create a low-res ('typewriter' resolution) scatterplot of the responses to q5 (vertical axis) vs. q4 (horizonatal axis), using a plotting symbol that shows whether the responsdent is a male or a female. If we have not covered how to show this '3rd dimension', look at the ONLINE Documentation file {the guide for most of the procedures covered in this set of exercises is in the Base SAS Procedures Guide; other procedures are in sthe more advanced 'STAT' module}. You can specify the variable whose values are to mark each point on the plot. See PLOT statement in PROC PLOT, and the example with variables
height weight and gender.
[in Stata] use the (automatically hi-res) graphics capabilities available from the 'Graphics' menu

[if SAS] Put all of the programs for Q1, and all of these program steps and output for Q2 in a single .txt file (JH will use a mono-spaced font such as Courier to view it -- that way the alignment should be OK), with PROC statements interleaved with output, and a helpful 2-line title (produced by SAS, but to your specifications) over top of each output. Get SAS to set up the output so that there are no more that 65 horizontal characters per line (that way, lines won't wrap-around when JH views the material).

[if Stata] paste the results and graphics into Word.

NOTE: To be fair to SAS, it CAN produce decent (and even some publication-quality) graphics. See

Then submit the text file electronically (i.e., by email) to JH by 9 am on Monday October 2.

(updated Sept 18, 2006)