McGill University, Department of Epidemiology
and Biostatistics and Statistics
513-678L: ANALYSIS OF MULTIVARIABLE DATA (June 2001)
Frequently Asked Questions
||What is meant by the prediction error of the equation that M&M refer to? The
SEE doesn't sound right, as if just refers to variabiliy around the line, much as
a standard deviation does. The SE of the slope and intercept represent error terms,
but is this what is meant. Or is it referring to the SE of the predicted Y? I vote
with this last one - but what think?
|I'm not 100% sure, but I lean towards the simple RMSE. I looked up their textbook
and I don't see any place where they explicitly introduce this terminology. The closest
I get is in section 2.3 of version 3, where they motivate least squares fits, and
define (on p 140) the difference between the observed and predicted y as 'the predction
error'. So my guess is that they are referring to residuals.
The issue then is how to numerically summarize the distribution of the residuals:
one one could the LARGEST, the MEDIAN, the AVERAGE ABSOLUTE, or theSQUARE ROOT OF
THE AVERGAE SQUARED, residual. Mostly, we use the root mean square error, which is
effectively the standard deviation of the residuals. It is interesting that M&M
use the term 'likely' size of the error.. I think the word 'likely' is their 'code'
for 'typical ' or 'average' (a bit like the use of 'type' in the French 'écart
type', which is a much more expressive term than the English 'standard' deviation.
Interestingly, -- before it was christened the "Standard' deviation (by Pearson
around 1900.. prior to that it was just called the 'modulus' of the curve) Galton
and others measured deviations in terms of Q, which was 1/2 the distance between
the 25th and 75th percentiles, a quantity they referred to as the 'probable error'.
One could see where they were coming from.. in a symmetric distribution, half of
the absolute deviations would be bigger than Q, and 1/2 less than Q , i.e., Q would
be the MEDIAN absolute deviation, so it was a 50:50 bet (or equally probable) that
the deviation was bigger than, or less than, Q.
The other reason I think they mean residuals (SEE or RMSE) in this context is that
they say 'State the regression equation and the likely size of prediction error associated
with that equation' It is natural to give the 3 estimated parameters of the regression
.. the 2 (b0 and b1) to specify the line, and the RMSE to specify the ('average')
spread of points about it.
BUT, I do agree with you that the SE of the predicted Y would be logical too.. except
that there would be a different one for each different value of X (i.e., wider at
the edges, and narrower closer to X = xbar). So I don't think they mean this. Of
course, there is a 1:1 relation between the two, since
= RMSE * Sqrt[ 1 + 1/n + function of
(a) distance of X from xbar
(b) spead of x's in sample used to fit line
||for assignment 2...
1. How do I do transformations on the X variable?
2. For question #6 "vocabulary for 1 child" I am not able to
find the documention for the data, I don't understand the meaning of year, month,
and age columns of the dataset.
|As I explain in q, if using insight, can go to edit -> variables and choose log
or sqrt or whatever.. it will add a new column
don't be fooled by the fact that all the transformations have
a (Y) as their argument.. this is generic and will work for whatever variable you
so to take log(vocsize) you click on the name vocsize then go to edit -> variables
and choose the function
2. 1 year 6 months (5th line) is vocab. when child is 18 mo. ie age is computed from
the 2 other columns
you may have to run the program to compute it if it isn't already there; If we only
had year and month
it would be strange to put both year and month in as predictor variables since one
could not link the coefficients to each other!
||I am having a lot of problems with downloading the SAS files for problem 4.
1. right-click on the sd2 file and save it into your sasuser directory.. then open
from insight (interactive part of sas)
2. open sas; switch to internet browser; click on raw data file (ie data and sas
pgm ); select all; copy; switch back to sas editor window; paste; run (the icon with
the guy running..);
then go to interactive to run analyses...