McGill University, Department
of Epidemiology, Biostatistics and Occupational Health
EPIB 613: Introduction to Statistical Software (Fall 2006)
Assignment 2, due October 2, 2006
Working in teams of two or three...
|
Q1
|
|
Create 5 datasets from the questionnaire data on temperature
perceptions etc.
(i) by importing directly from the Excel file, using the import wizard in SAS; or
via the 'insheet' command in Stata (applied to .csv version of Excel file);
(ii) by first removing the first row (of variable names) and exporting the Excel
file into a 'comma-separated-values" (.csv) text file, then ...
reading the data in this .csv file via the INFILE and INPUT statements in a SAS DATA
step,
[SAS]
INFILE 'path' DELIMITER =",";
INPUT ID MALE $ MD $ EXAM TEMPOUTC TEMPINC TEMPOUTF TEMPINF TEMPFEEL TIME PLACE $
;
or via the 'infile' command in Stata
(iii) by reading the data in the text file temps_1.txt into
the SAS dataset via the INFILE and INPUT statements. Notice that the 'missing'
values use the SAS representation (.) for missing values.
or the Stata dataset using the 'infile' command
(iv) by reading the data in the text file temps_2.txt via [in SAS] the INFILE and
INPUT statements in a DATA step or [in Stata] the 'infix' command.
Here you will need to be careful, since 'free-format' will not work correctly
(it is worth trying free format with this file, just to see what goes wrong!). When
using the INFILE method, you can control some of the damage by using the 'MISSOVER'
option in the INFILE statement: this keeps the INPUT statement from continuing on
into the next data line in order to find the (in our example) 11 values implied by
the variable list. JH uses this 'defensive' option in ALL of his INFILE statements.
(v) by cutting and pasting the contents of the text file temps_2.txt directly into
the SAS or Stata program - in SASthe lines of data go immediately after the DATALINES
statement, and there needs to be a line containing a semicolon to indicate the end
of the data stream. In Stata, the lines of data go immediately after the infile or
infix statement, and there needs to be a line containing the word 'end' to indicate
the end of the data stream
This Cut and Paste Method is NOT RECOMMENDED when the number of observations is large,
as it is too all too easy to inadvertently alter the data, and the SAS/Stata porogram
becomes quite long and unwieldy. It is Good Data Management Practice to separate
the program statements from the data.
[Run [in SAS] PROC MEANS [in Stata] the 'describe' command, on the numerical variables,
and [in SAS] PROC FREQ or [in Stata] the 'tabulate' command, on the non-numerical
variables, to check that the 5 datasets you created contain the same information.
Also, get in the habit of viewing or printing several observations and checking the
entries against the 'source'.
When using (i), have SAS show you the SAS statements generated by the wizard. Store
these, and the DATA steps for (ii) to (v) in a single SAS program file (with suffix
.sas).
Annotate liberally using comments:
in SAS, either begin with * ; or enclose with /* ... */
in Stata ..begin the line with * or place the comment between /* and */ delimiters
or begin the comment with // or begin the comment with /// |
|
|
|
Q2
|
|
Use one of these 5 datasets, and the appropriate [in
SAS, PROCs (see Exploring Data under UCLA SAS Class Notes 2.0)], or [in Stata, the
list comamnd, and the analyses from the Statistics menu] to
(i) list the names and characteristics of the variables
(ii) list the first 5 observations in the dataset
(iii) list the id # and the responses just to q3, w5 and q6, for all respondents,
with respondents in the order: female MDs, male MDs, female non-MDs, male non-MDs.
Indicate the [sub-]statement that is required to reverse this order.
(iii) create a 2-way frequency table, showing the frequencies of respondents in each
of the 2 (MD nonMD) x 2 (male female) = 4 'cells' (one defintion of an epidemiologist
is 'an MD broken down by age and sex'). Turn off all the extra printed output, so
that the table just has the cell frequencies and the row and column totals.
(iv) compare the mean and median attitude to exams in MDs vs. non-MDs (hint: in SAS,
the CLASS statement may help). Get SAS/Stata to limit the output to just the 'n',
the min, the max, the mean and the median for each subgroup. And try to also get
it to limit the number of decimal places of output (in SAS the MAXDEC option is implememnted
in some procedures, but as far as JH can determine not in all)
(v) compare the mean temperature perceptions (q6) of male and female respondents
(vi) [in SAS] create a low-res ('typewriter' resolution) scatterplot of the responses
to q5 (vertical axis) vs. q4 (horizonatal axis), using a plotting symbol that shows
whether the responsdent is a male or a female. If we have not covered how to show
this '3rd dimension', look at the ONLINE Documentation file {the guide for most of
the procedures covered in this set of exercises is in the Base SAS Procedures Guide;
other procedures are in sthe more advanced 'STAT' module}. You can specify the variable
whose values are to mark each point on the plot. See PLOT statement in PROC PLOT,
and the example with variables height
weight and gender.
[in Stata] use the (automatically hi-res) graphics capabilities available from the
'Graphics' menu
[if SAS] Put all of the programs for Q1, and all of these program steps and
output for Q2 in a single .txt file (JH will use a mono-spaced font such as
Courier to view it -- that way the alignment should be OK), with PROC statements
interleaved with output, and a helpful 2-line title (produced by SAS, but to your
specifications) over top of each output. Get SAS to set up the output so that there
are no more that 65 horizontal characters per line (that way, lines won't wrap-around
when JH views the material).
[if Stata] paste the results and graphics into
Word.
NOTE: To be fair to SAS, it CAN produce decent (and even some publication-quality)
graphics. See http://www.ats.ucla.edu/stat/sas/topics/graphics.htm
|
Then submit the text file electronically (i.e., by
email) to JH by 9 am on Monday October 2. |
|
(updated Sept 18, 2006)
|