Course 678 Web Page

McGill University, Department of Epidemiology and Biostatistics and Statistics
513-678L: ANALYSIS OF MULTIVARIABLE DATA (June 2001)

Frequently Asked Questions (FAQ)

date/time Question Response
June 9 What is meant by the prediction error of the equation that M&M refer to? The SEE doesn't sound right, as if just refers to variabiliy around the line, much as a standard deviation does. The SE of the slope and intercept represent error terms, but is this what is meant. Or is it referring to the SE of the predicted Y? I vote with this last one - but what think?
I'm not 100% sure, but I lean towards the simple RMSE. I looked up their textbook and I don't see any place where they explicitly introduce this terminology. The closest I get is in section 2.3 of version 3, where they motivate least squares fits, and define (on p 140) the difference between the observed and predicted y as 'the predction error'. So my guess is that they are referring to residuals.

The issue then is how to numerically summarize the distribution of the residuals: one one could the LARGEST, the MEDIAN, the AVERAGE ABSOLUTE, or theSQUARE ROOT OF THE AVERGAE SQUARED, residual. Mostly, we use the root mean square error, which is effectively the standard deviation of the residuals. It is interesting that M&M use the term 'likely' size of the error.. I think the word 'likely' is their 'code' for 'typical ' or 'average' (a bit like the use of 'type' in the French 'écart type', which is a much more expressive term than the English 'standard' deviation.

Interestingly, -- before it was christened the "Standard' deviation (by Pearson around 1900.. prior to that it was just called the 'modulus' of the curve) Galton and others measured deviations in terms of Q, which was 1/2 the distance between the 25th and 75th percentiles, a quantity they referred to as the 'probable error'. One could see where they were coming from.. in a symmetric distribution, half of the absolute deviations would be bigger than Q, and 1/2 less than Q , i.e., Q would be the MEDIAN absolute deviation, so it was a 50:50 bet (or equally probable) that the deviation was bigger than, or less than, Q.

The other reason I think they mean residuals (SEE or RMSE) in this context is that they say 'State the regression equation and the likely size of prediction error associated with that equation' It is natural to give the 3 estimated parameters of the regression .. the 2 (b0 and b1) to specify the line, and the RMSE to specify the ('average') spread of points about it.

BUT, I do agree with you that the SE of the predicted Y would be logical too.. except that there would be a different one for each different value of X (i.e., wider at the edges, and narrower closer to X = xbar). So I don't think they mean this. Of course, there is a 1:1 relation between the two, since

SE(Ynew)

= RMSE * Sqrt[ 1 + 1/n + function of
(a) distance of X from xbar
(b) spead of x's in sample used to fit line
]
  for assignment 2...

1. How do I do transformations on the X variable?

2. For question #6 "vocabulary for 1 child" I am not able to find the documention for the data, I don't understand the meaning of year, month, and age columns of the dataset.

As I explain in q, if using insight, can go to edit -> variables and choose log or sqrt or whatever.. it will add a new column

don't be fooled by the fact that all the transformations have
a (Y) as their argument.. this is generic and will work for whatever variable you select..

so to take log(vocsize) you click on the name vocsize then go to edit -> variables and choose the function

2. 1 year 6 months (5th line) is vocab. when child is 18 mo. ie age is computed from the 2 other columns

you may have to run the program to compute it if it isn't already there; If we only had year and month
it would be strange to put both year and month in as predictor variables since one could not link the coefficients to each other!
. I am having a lot of problems with downloading the SAS files for problem 4. 2 options..

1. right-click on the sd2 file and save it into your sasuser directory.. then open from insight (interactive part of sas)

2. open sas; switch to internet browser; click on raw data file (ie data and sas pgm ); select all; copy; switch back to sas editor window; paste; run (the icon with the guy running..);
then go to interactive to run analyses...