Course 678 Web Pages

Some Notes on Using SAS via SAS EDITOR

(j.h. and a.n. 97.06.07)

See also notes prepared by Marielle Olivier, available on shelf in computer laboratory

SAS is organized around 3 windows

PROGRAM EDITOR
LOG
OUTPUT

Typical sequence is to

prepare program commands in the PROGRAM window, save them, then submit them for batch processing
examine the 'log' or 'report' displayed in the LOG window. Errors are highlighted. The most common are
- semicolons (;) , used to signify the end of a statement, are missing
- the names of variables, procedures or options are mis-spelled
examine output displayed in OUTPUT window (if program was successful)
(if not) fix up any errors and submit the program again.

Before resubmitting, you will wish to 'clear' the LOG window; you may also wish to clear the OUTPUT window, so that output from previous submissions in the same session do not accumulate and confuse.

A SAS Program

A SAS 'program' (at least for a beginning user, working on a small dataset) is likely to consist of the following:

DATA step
RUN (to close the DATA step... not essential but helpful)
PROC (short for PROCEDURE)
RUN (again optional... but cannot hurt)
(Maybe another) PROC
etc
RUN (a statement to process the above requests; one RUN statement is essential)

You can save the 'program' for future use/modification. Do so from the file menu when in the PROGRAM window. Many users use the suffix '.sas', to designate a file containing sas statements and requests for procedures to be run.

The DATA step -- overview

There are a number of ways to set up the data for use in one or more PROCS.

have the 'raw' data stored as text in an separate ascii file, and have SAS read it in and 'parse' it in and make into 'observations' using the INPUT statement; RECOMMENDED!!
have the 'raw' data listed as text in the program, and have SAS read it in and 'parse' it in and make into 'observations' using the INPUT statement;

When you run the DATA step, SAS sets up (internally, not visible to you) a binary file containing the names of the variables, together with the data. This datafile (we call it a SAS dataset) is then accessible to for the remainder of the session. By default, it is a 'scratch' file that will disappear at the end of the session when you exit SAS; if it is a big file that takes a lot of time to create, and that you will be going back to lots of time for further analyses, you might want to make it 'permanent'. There is no need to do so in this course. Provided you keep the '*.sas' file (and the raw data file -- if you keep the data in a separate ascii file), you can always recreate the SAs dataset just by going back to your file containing the DATA step at another time and rerunning it.
have the data already saved in a permanent SAS dataset. This way, you bypass the INPUT statement (since the variable names are already stored with the permanent dataset) and use the SET statement to read the observations in from the permanent dataset. We will not need this way of doing things for this course, but you would probably want to do so for your thesis.

The DATA step -- in more detail

The step begins with the reserved word DATA

This tells SAS that you wish to set up -- permanently or just for this session -- a sas dataset.

In SAS parlance, a dataset is a number of 'observations' ... we might call them 'cases' or 'subjects'. Each 'observation' consists of the same a fixed number of variables; so you might think of the dataset as a 'rectangular' file of so many cases ('observations' or 'rows') and so many variables/observation ('columns')

Following the word DATA you supply a name for this dataset you are asking SAS to create. To save typing, and since I seldom have to use the same later in the program, I typically call the dataset 'a'; you might want to get fancier e.g.
- DATA alcohol;
  
  or...
- DATA heights;
  
  etc.
Remember to put a semicolon (;) after the name.
Next (if you have the raw data in a separate file) ... the INFILE statement.

This is just a pointer to the file in question. To be safe, you can give the full path name...

e.g. INFILE 'a:\course67\alcohol.dat' ;

remember to enclose the path name in single quotes, and -- because INFILE is a SAS statement -- to end it with a semicolon.
Next (if, as you are likely to be doing in this course, you are creating the sas dataset from raw data) ... the INPUT statement.

This is a directive to SAS on how to 'parse' the file containing the raw data. In this course this can be as simple as supplying the names you want to give each variable. Most of the raw data files we have put in the www site have spaces separating the variables (so called "free format") and so there are no special instructions on starting and ending columns. So for example, you might say

INPUT age gender height weight ; (remember the ';' !!!)

If you wanted to tell SAS that age was always in columns 1-2, gender in coulumn 4 etc, you could put

INPUT age 1-2 gender 4 height 6-9 weight 11-14;

By default, SAS will assume that variables are numeric; if you have a variable containing alpha-numeric data (e.g. if in the raw datafile you had m for male and f for female, you would tell SAS that by saying

INPUT age 1-2 gender $ 4 height 6-9 weight 11-14;

If gender is alphanumeric, it can be used in tabulations etc but not in any arithmetic. It can be used as a 'class' variable in regression/anova procedures. One can always make a new variable e.g. using the statements

gender_n = . ;
if gender = 'm' then gender_n = 0; * NOTE: 'm' and 'M' not same;
if gender = 'f' then gender_n = 1;

Most users prefer to represent gender numerically from the start. It also saves on data entry if the enterer can use the numeric keypad rather than hunting for the m and f keys.

If you want to have the labels m and f (or male/female or whatever...) rather than the 0/1 appear on printouts, you can do so using the FORMAT statement in the DATA step.
Next (if you want to create derived variables or exclude certain observations from the dataset being created) ...

programming statements such as...

bmi = weight / (weight*weight); /*creates a new variable

and adds it to dataset */

if gender = 1; /*includes only those with

gender = 1 */

a_g_term = age*gender; /*create interaction term */

Notes:

You can put comments in your program in two ways:-
- by starting the statement with an asterisk and ending it (as usual) with a semicolon...
  
  e.g.
  
  * include females only ;
  
  * the next steps are to set up data for table 1;
- by surrounding the comment(s) by
  
  /* at the beginning
  
  and
  
  */ at the end .......see bmi example above
  
  SAS ignores everything in between.
  
  This trick is helpful when you want to run just a part of a program but don't want to delete any of your hard-thought-out steps or PROCs that you might want some other time.
Statements can run on from one line to next and you can use blank lines for readability. Indents also help show structure of program. I find it helpful to put names of variables in lower case and use upper case for reserved SAS words. Max of 8 letters for name of a variable; can use underscore e.g. age_dx age_tx for readability; name must start with letter.

EXAMPLE of DATA step followed by SORT and several procedures;

DATA alberta; INFILE 'alberta.dat'; INPUT id_no age gender height weight; bmi = weight / height**2; * **2 is same as 'to power of 2'; if age >= 11 and age <= 15; * careful with 'ands' and or's' ; PROC SORT; BY gender; * sorts the dataset 'alberta' by gender; * otherwise leaves dasaset contents as is; RUN; PROC MEANS; var height weight; BY gender; * repeats procedure for each gender; * must have used SORT beforehand; PROC PLOT FORMCHAR='-----------'; /* formchar supplies character */ /* for borders of plot */ PLOT weight*height = gender; * uses values of gender as symbol; RUN; PROC PLOT; PLOT Y1*X = '1' Y2*X='2' / OVERLAY; * puts both plots on same graph ; * using the symbols 1 and 2 respectively; PROC GLM; MODEL weight = height ; BY gender; RUN; PROC REG; MODEL weight = height; * like GLM but uses continuous x's only ; BY gender; * does not allow 'class' variables ; * does not produce Type I and III SS ; DATA males; * creating a new dataset; SET alberta; * reads observations from existing dataset ; * created earlier in session, or stored as ; * a permanent dataset, ... ; IF gender = 0; * allows only those with gender = 0 to be ; * taken into new dataset ; RUN; DATA females; * creating yet another... ; * alberta and males still exist and are ; * available to all PROCS ; SET alberta; IF gender = 1; RUN;

A PROGRAM TO ILLUSTRATE SOME SELECTED PROCEDURES AND FEATURES OF SAS : MEAN, PLOT, GLM, REG, OUTPUT, MERGE, OVERLAY, BOX

OPTIONS LS=65 PS=65; DATA a; INFILE 'a:kkm5_8.dat'; INPUT salary gpa; ID = _N_; RUN; DATA f2; SET a; PROC MEANS; PROC PLOT; PLOT salary * gpa; PROC GLM; MODEL salary = gpa; PROC REG; MODEL salary = gpa/CLM CLI; /* CI for mean, individuals */ OUTPUT OUT = temp PREDICTED = p L95M=lm U95M=um L95=li U95=ui; RUN; DATA f3; SET temp; ID = _N_; RUN; DATA f4; MERGE f2 f3; BY id; RUN; DATA f5; SET f4; PROC PLOT; PLOT salary*gpa='s' p*gpa='p' lm*gpa='*' um*gpa='*' li*gpa='+' ui*gpa='+' / OVERLAY BOX; RUN;

General comments

If you minimize the PROGRAM Window before you run the program, you will be able to see the LOG window and tell by the colours of the messages whetehr your program has been successful!!

OUTPUT and LOG windows

You can save the contents of these windows:- use the 'save' or 'save as' command in the file menu.

You can customize the width (no of characters accross) and height (number of lines down) of the OUTPUT pages... using the OPTIONS statement at the beginning of the program...

PAGESIZE (or PS for short) # of lines on page
LINESIZE (or LS for short) # of character spaces accross the page

e.g.

OPTIONS LINESIZE = 75 PAGESIZE = 60; /* 60 lines of 75 characters */

If you save the OUTPUT or LOG file and then open it in a wordprocessor, better to use a MONO-spaced font such as COURIER ... otherwise tables and plots will not line up.