/*  for Stata */

/* given the size of the raw data file, better to keep
   data separate from the program.. so use INFILE
   rather than LINES

   so download the chdage.dat file , save it somewhere
   on the hard disk (remember the path!), then have the
   INFILE statement point to it... ie give the full path

   e.g. if you store the .dat file in sub-directory or folder
        called c:\681folder\ , path would be "c:\681folder\chdage.dat"

*/

* switch to working directory, where data stored
* change thsi to your path [and use backslash \ rather than : ]
* e.g. cd "c:\681stuff\alr_1\"

cd ":Macintosh HD:User:dad:courses:681:alr_1:"

clear

infile id age chd using chdage.dat

* create age categories 

gen age_mid = age
 
recode age_mid 20/29=25 30/34= 32 35/39=37 40/44=42 45/49=47 50/54=52 55/59=57 60/69=65

save chdage, replace

* -------------------------------------------------------

* plot the Raw data 

graph chd * age

* ------------------------------------------------------- 

* Table 1.2 CHD vs Categorised ages; 

tabulate age_mid  chd,  row 

* ------------------------------------------------------- 

* make means (proportions) of chd by age_mid
* save into new file called prevalences (say) 

collapse (mean) chd , by(age_mid)

save prevalences, replace

* ------------------------------------------------------- 

* plot the prevalences  with age categorized 

graph chd   age_mid 

* ------------------------------------------------------- 

* bring back the full data

clear 

* fit Logistic regression by Generalized linear model
 
* supply a binomial denominator of 1 for each person

use chdage

glm chd  age , family(binomial 1) link(logit)

* add fitted prevalence 

predict fitted_p 

* ------------------------------------------------------- 

* fit Logistic regression by special program for logistic
* and create fitted value

* it doesnt give betas, only odds ratios
* so type logit after estimation to get coefficients

logistic chd age
logit 

predict fitted

* -------------------------------------------------------

* plot the prevalences fitted by logistic 

graph fitted  age


* save data and fitted values 

save chdage, replace

* -------------------------------------------------------

* combine fitted points for smooth curve  with observed prevalences

clear
use chdage
sort age_mid
save chdage, replace 
clear
use prevalences
sort age_mid
merge age_mid using chdage 
tabulate _merge

* ---------------------------- 

* Overlay the observed and fitted prevalences;

graph chd fitted  age if age == age_mid