/* modified feb 1 to remove remnants of unrelated sas code */ /* --- for SAS --- */ /* given the size of the raw data file, better to keep data separate from the program.. so use INFILE rather than LINES so download the pros.dat file , save it somewhere on the hard disk (remember the path!), then have the INFILE statement point to it... ie give the full path e.g. if you store the .dat file in sub-directory or folder called c:\681folder\ , path would be "c:\681folder\pros.dat" MISSOVER option in infile is important safeguard against SAS taking going into next line of raw data in order to find as many data items as there are variables in the INPUT statement (eg if for some reason you had blank fields With MISSOVER, can limit the damage to the offending record. */ OPTIONS LINESIZE=85 PAGESIZE=60 ; * change #chars/line #lines/page */ RUN; DATA prostate; /* make a 2-part name if wish to create perm. file */ /* rather than re-creating the dataset each time */ /* then next time would use library instead of DATA step */ *INFILE "c:\681folder\pros.dat" ; INFILE "Macintosh HD:User:dad:courses:681:alr_1:pros.dat" MISSOVER; INPUT id capsule age race dpros dcaps psa vol gleason; /* create PSA categories */ psa_cat = .; /* program defensively */ if 0 <= psa < 4 then psa_cat = 1; if 4 <= psa < 10 then psa_cat = 2; if 10 <= psa < 20 then psa_cat = 3; /* JH lazy, H&L want finer ones */ if 20 <= psa < 50 then psa_cat = 4; if 50 <= psa < 900 then psa_cat = 5; IF id ne . ; /* For explanation of var. names & codes, cf Table 1.7 p 27 2nd ed. of H&L. Good idea to put codes as comments (or even as labels) in program. If you wish, change the variable names to something more meaningful (I dont find dpros particularly memorable, and suggest maybe dre_rslt) and use the up to max 8 characters allowed by SAS version 6. (>8 with version 8.. but cant interchange pgm with person using ver 6) Dont (as I notice many do) try to save wear and tear on your fingernails by using the fewest number of letters possible. Make it easy for others (eg your chief, yourself in a years time, ..) to understand. Variable LABELS are helpful if going to be showing your chief.. I put 2 in below to remind myself what dpros and dcaps are, and 1 to remind me what boundaries I chose for psa_cat.) Can also put in value labels via a PROC FORMAT PROC FORMAT is run outside of DATA step, and the same value labels can be used acroos unrelated datasets.. a common use is for showing the verbal description of numerical codes e.g. if you use 1=yes, 2 = no [or 1 = no 2 = yes ) or 1=yes 0=no JH's suggestion for yes no data .. or other binary (dichotomous) vars. USE an INDICATOR variable DIRECTLY.. where 1 = (<>) the PRESENCE of the condition/state/trait 0 = (<>) the ABSENCE of the condition/state/trait e.g. INSTEAD OF naming a variable 'SEX' and then trying to decide (or later, remember) which code you used for male and which female WHY NOT call the variable I_male 0 = no , ie female 1 = yes , ie male ? This admittedly asymmetric way to chose which state/trait does however mean that you wont have to ever look it up later.. By the way, in better families we do not refer to these <> variables as "DUMMY" variables [I did have a major mis-communication with someone else about my use of 'indicator' .. this person was using 'indicator' the way people do when they speak of (say) economic 'indicators' ] */ LABEL dpros = 'Results of the Digital Rectal Exam' ; LABEL dcaps = 'Detection of Capsular Involvement in Rectal Exam ' ; LABEL psa_cat = 'PSA category 0-4 4-10 10-20 20-50 >50' ; RUN; * -------------------------------------------------------; TITLE Some data checking...; TITLE2 Note: use MAXDEC to keep # decimal places sensible ; TITLE3 Transcribing/reporting Default # of decimal places: naive!! ; PROC MEANS DATA=prostate MAXDEC=1; /* use MAXDEC to keep # decimal places sensible default # of SAS-reported decimal places: sign of a novice */ RUN; * -------------------------------------------------------; TITLE Some more data checking... (more detail for interval variables); PROC UNIVARIATE DATA=prostate MAXDEC=1; VAR psa vol gleason; RUN; /* This last PROC may not have run; You would think SAS would have the MAXDEC option for all procedures.. It does not, or at least, not in my version 6. So you may have to go and remove this option from PROC UNIVARIATE You dont have to remove it entirely.. just comment it out [maybe SAS will put it into the next version.. here's hoping ] */ PROC UNIVARIATE DATA=prostate /* MAXDEC=1 */ ; VAR psa vol gleason; RUN; * -------------------------------------------------------; TITLE Some more data checking... (more detail for categorical variables); PROC FREQ DATA=prostate; TABLES capsule race dpros dcaps gleason psa_cat ; RUN; * -------------------------------------------------------; TITLE PLOT the Raw data ; PROC PLOT DATA=prostate; PLOT capsule * psa / HPOS=75 VPOS=15 ; RUN; * -------------------------------------------------------; TITLE PLOT the data with psa categorized; PROC PLOT DATA=prostate; PLOT capsule * psa_cat / HPOS=75 VPOS=15 ; RUN; * -------------------------------------------------------; TITLE Table of proportions positive ; PROC FREQ DATA=prostate; TABLES capsule * psa_cat / NOROW NOPERCENT ; RUN; * -------------------------------------------------------; TITLE another way, that shows proportions directly ; PROC MEANS DATA=prostate MEAN; /* just ask for MEAN statistic . /* mean of a set of 0's and 1's = what proportion of the set = 1 */ VAR capsule ; CLASS psa_cat ; /* produces summary stats for each psa category */ OUTPUT OUT = summary MEAN = p_pos_ve; /* makes new dataset of the summary statistics and allows you to rename the statistic I decided to rename the MEAN statist p_pos_ve (with 8 letters, p_pso_ve = prorportion_positive) I decided to call the new dataset summary */ RUN; TITLE Now plot the proportions (called p_pos_ve) against psa_cat ; PROC PLOT DATA=summary ; /* NB: I now use my 'summary' dataset */ PLOT p_pos_ve * psa_cat = '*' / /* can use another symbol besides * */ HPOS=75 VPOS=20 ; RUN; * -------------------------------------------------------; TITLE1 Might be better if I plotted the proportions +ve not vs. ordered ; TITLE2 category values 1,2,3,4,5 but vs. mean of psa values in category; PROC MEANS DATA=prostate MEAN; /* just ask for MEAN statistic . /* mean of a set of 0's and 1's is the proportion of the set where value = 1 */ VAR capsule psa ; CLASS psa_cat ; OUTPUT OUT = summary2 MEAN = p_pos_ve mean_psa; /* 'mean y' and a 'mean x' per category */ RUN; PROC PLOT DATA=summary2 ; PLOT p_pos_ve * mean_psa = '*' / HPOS=75 VPOS=20 ; RUN; * -------------------------------------------------------; TITLE1 Previous pattern still does not look great; TITLE2 Might be better if plotted proportions +ve vs. MEDIANS; PROC UNIVARIATE DATA=prostate ; VAR capsule psa ; CLASS psa_cat ; OUTPUT OUT = summary3 MEAN = p_pos_ve mean_psa MEDIAN= anything med_psa; /* wont use the variable anything .. just there to placate sas */ /* had to put it in to get mean & for both vars. */ /* ie if ask for 3 stats for 4 VARs, supply 4 names for each stat */ RUN; /* CLASS doesn't work in PROC UNIVARIATE, at least not in version 6 so have to resort to another way to get there */ PROC SORT DATA=prostate; BY psa_cat; PROC UNIVARIATE DATA=prostate ; BY psa_cat; VAR capsule psa ; OUTPUT OUT = summary3 MEAN = p_pos_ve mean_psa MEDIAN= anything med_psa; /* wont use anything */ RUN; /* Good idea, whenever you create/manipulate a data set , to look at its contents via the viewer, or with more options , inside of INSIGHT You can access INSIGHT (not called such in menu) via Tools Menu -> Solutions -> Analysis -> Interactive Data Analysis (location in menu may vary from version to version) It is great for checking if your programming, recoding etc worked.. and you can (by clicking on the popup menu in the top left corner of the data window), sort the data by one/more varaibles, move variables (columns) to be close to each other, etc and you can use INSIGHT for some quite sophisticated analyses too.. (some 'macho' SAS types call INSIGHT '[partial]sas for sissies' ) */ PROC PLOT DATA=summary3 ; PLOT p_pos_ve * med_psa = '*' / HPOS=75 VPOS=20 ; RUN; * ---continue on following the template for agechd example ----------;