Course 678: Analysis of Multivariable Data. June 1999 Homework to be handed in by Friday June 11 ***************************************************************** Question 1 ========== KKMN Ch 11, Question 7 (p 202, 204, ... ). - parts a, b, c Question 2 ========== For the alcohol and eye movement data ... a Fit the model Y X DECREASE Alcohol Gender b Interpret the coefficients ... and draw a diagram -- for father-in-law! (You might find it easier if you code gender as 0 and 1, rather than 1 and 2... you can alter the values right in the data window if you wish, or you can use the method described in question 3 below for subtracting a constant from a variable) c Fit the model Y X DECREASE Alcohol Gender Gender*Alcohol (select both & click "CROSS") d Interpret the coefficients ... and draw a diagram. You can get INSIGHT to plot the predicted (fitted) responses against alcohol. e In this 4-parameter model, if we were to judge by the 4 t-ratios and their associated p-values, none of the 4 parameters is statistically significant .. is this a correct interpretation? Explain. f For each of the 2 genders separately, regress the decrease on alcohol. (in INSIGHT, you can do this by first making gender a group variable .. click on the rectangle above the column label) Match up the coefficients of these two separate equations with the equations implied by the 1 "master" equation with 4 coefficients. g The sample size is small, and so only the male slope is statistically significant at the 0.05 level. compare: females males n 6 6 slope 0.20 0.42 SE(slope) 0.41 0.12 RMSE 16.8 13.2 ("average" residual) Can you explain why the SE is so much smaller for males? [hint: it has something to do with gender differences in perceptions of when individuals felt too drunk to drive] If you wish, colour the observations from one of the two genders .. use EditMenu -> Windows -> Tools Click on a color and select say gender = 1 h Fit the model Y X DECREASE Alcohol Gender Gender*Alcohol but with the "intercept" turned off. Draw the lines implied by the fitted equation. Rewrite the 4-parameter equation as DECREASE = (B0 + B2.Gender) + (B1 + B3.Gender)*Alcohol Question 3 ========== For the following analyses, use the data on GIRLS in the intervention and 4 BORDER MUNICIPALITIES in the article "The Lidkoping Accident Prevention Programme -- a community approach to preventing childhood injuries in Sweden" by Svanstrom L et al (Injury Prevention 1995 1: 169-172). (under datasets on class web page) Download the dataset, say to your "a" drive Paste the following SAS program into the Program Editor DATA sasuser.ldkoping; INFILE 'a:\ldkoping.dat' ; INPUT year rate pop number gender area; IF area = 0 or area = 1; RUN; and click on the "run" icon to produce and save a sas file called "ldkoping". The log should then report... "SASUSER.LDKOPING has 36 observations & 6 variables." In INSIGHT, extract a dataset of observations on girls.. To do so... Edit menu -> Observations -> Find... Click on GENDER Click on = Click on 0 Click on OK On the little triangle at top left corner of data window Extract If you wish, you can now close the data window with the 36 observations, and restrict attention to the one with 18. [You could also do the selection in the SAS program Editor, just as we did with selecting fruitflies with 1 partner] a For each of the 2 areas separately, regress the rate on year. (in INSIGHT, you can do this by first making area a group variable .. click on the rectangle above the column label) Refer to Table 2 of the article. - Verify the "beta's" of -0.3 and 0.2 for the two areas. - Divide them by the average rates to get a "%change per year". b Interpret the "INTERCEPT" values in your two regressions, when using year as "Annum Domini" c Change year from "Annum Domini" to "year of program" Edit menu -> Variables -> Click YEAR into the Y box Click on the "a + b*Y" transformation Set a to -1983 and b to 1 Click on OK This should produce a new year ("A_YEAR") that starts at 1983 (If you like, you can double click on the name A_YEAR and change it to something more meaningful, like Pgm_Year, short for "Program_Year") [ You can SKIP this note for now if you wish ] [ If you wish, you can create this variable directly in the ] [ SAS program you ran from the program Editor... ] [ ] [ "Recall text" (in LOCALS menu in SAS Editor) ] [ Insert the line (including the semi-colon!!) ] [ ] [ Pgm_Year = year - 1983; ] [ ] [ after the line with the INPUT statement ] [ and before the RUN statement ] [ ] [ click on the "run" icon to recreate and save sas file ] [ called "ldkoping". Provided you do not have the sasfile ] [ open in INSIGHT, it will just "write over" the old one. ] Re-run the 2 regressions using the new "year", and re-interpret the coefficients. Compare, and comment on, the SE's of the intercept's in the models using the "Y-almost-2K" and the "Y-starting in 1983" versions of year. d Why would one switch to the new variable, year_pgm = year - 1983 ? e Use a t-test to formally test the between-area difference in the annual change in incidence (to save you time: sqrt[0.29*0.29 + 0.32*0.32] = 0.43 ) f Remove the "Group" designation from area Fit a single regression equation to all 18 observations... Y X RATE Pgm_Year Area Area*Pgm_Year (you put in this product by selecting both and clicking "CROSS" g Use the 4 parameter estimates to recreate the equations of the the 2 fitted lines Draw them in on the scatterplot of rate vs Pgm_Year, or... Have INSIGHT make a scatterplot of the predicted rates versus the Pgm_Year If you wish, colour the observations from one of the areas .. use EditMenu -> Windows -> Tools Click on a color and select say area = 1 h Interpret the coefficient associated with the product-term Area*Pgm_Year. Show that it agrees exactly with the results obtained by fitting two separate lines in question b. Use the SE associated with the Area*Pgm_Year coefficient to formally test the observed between-area difference in the annual change in incidence. ======================== For later... By restricting our observations to girls, we have only less than half the "effective sample size" we could have. So, go back to the full dataset ... and re-do the analysis using all 36 observations. How come the SE of the estimate of interest ("difference of slopes") becomes bigger, rather than smaller, when using more data? How might we overcome this ? THINK "Eliminate (take OUT) noise" by putting IN variables responsible for the noise"