/*  --- for Stata ---  */

/* given the size of the raw data file, better to keep
   data separate from the program.. so use infile
   rather than input

   so download the pros.dat file , save it somewhere
   on the hard disk (remember the path!)

   e.g. if you store the .dat file in sub-directory or folder
        called c:\681folder\ , path would be "c:\681folder\pros.dat"

*/

/* RANT by JH re "dummy" variables   (not particularly here , but in future)

  JH's suggestion for yes no data .. or other binary (dichotomous) vars.

  USE an INDICATOR variable DIRECTLY..
    where 1 = (<<indicates>>) the PRESENCE of the condition/state/trait
          0 = (<<indicates>>) the ABSENCE  of the condition/state/trait

  e.g. INSTEAD OF naming a variable 'SEX' and then trying to decide
       (or later, remember) which code you used for male and which female

       WHY NOT  call the variable  I_male  0 = no ,  ie female
                                           1 = yes , ie   male  ?

       This admittedly asymmetric way to chose which state/trait
       does however mean that you wont have to ever look it up later..

       By the way, in better families we do not refer to these <<indicator>>
       variables as "DUMMY" variables

       [I did have a major mis-communication with someone else about
       my use of 'indicator' .. this person was using 'indicator'
       the way people do when they speak of (say) economic 'indicators' ]
*/

* switch to working directory, where pros.dat file is stored
* change this to your path [and use backslash \ rather than : ]
* e.g. cd "c:\681stuff\alr_1\"

cd ":Macintosh HD:User:dad:courses:681:alr_1:"

clear

infile id capsule age race dpros dcaps psa vol gleason using pros.dat

* create PSA categories   

gen psa_mid = psa
 
recode psa_mid 0/2.4=1.2 2.5/4.4= 3.5 

********  USER to recode the rest of the categories ******
 
save pros, replace

* -------------------------------------------------------

* plot the Raw data 

* ******** rest of the commands below were cut/pasted from chdage example *****
* ******** USER must change them to match the psa variables               *****

graph chd * age


* ------------------------------------------------------- 


* Table 1.2 CHD vs Categorised ages; 

tabulate age_mid  chd,  row 

* ------------------------------------------------------- 

* make means (proportions) of chd by age_mid
* save into new file called prevalences (say) 

collapse (mean) chd , by(age_mid)

save prevalences, replace

* ------------------------------------------------------- 

* plot the prevalences  with age categorized 

graph chd   age_mid 

* ------------------------------------------------------- 

* bring back the full data

clear 

* fit Logistic regression by Generalized linear model
 
* supply a binomial denominator of 1 for each person

use chdage

glm chd  age , family(binomial 1) link(logit)

* add fitted prevalence 

predict fitted_p 

* ------------------------------------------------------- 

* fit Logistic regression by special program for logistic
* and create fitted value

* it doesnt give betas, only odds ratios
* so type logit after estimation to get coefficients

logistic chd age
logit 

predict fitted

* -------------------------------------------------------

* plot the prevalences fitted by logistic 

graph fitted  age


* save data and fitted values 

save chdage, replace

* -------------------------------------------------------

* combine fitted points for smooth curve  with observed prevalences

clear
use chdage
sort age_mid
save chdage, replace 
clear
use prevalences
sort age_mid
merge age_mid using chdage 
tabulate _merge

* ---------------------------- 

* Overlay the observed and fitted prevalences;

graph chd fitted  age if age == age_mid