Lawrence Joseph
Bayesian statistics

Please Note: Prof. Joseph has retired.
These pages are left up in case they prove
useful, but the pages and software will
no longer be updated. All material and
software is "as is" with no guarantees
of functionality or correctness.
Basic info Course outline
Frequently Asked Questions

1. What does this error message mean in R: The following object(s) are masked from ...
This means that you have defined a variable twice with the same name. You usually notice this when you gave a variable whose name is used often, like "age", already defined in R, and then you attach a data frame that has a variable with the same name inside of it.

One way to avoid this problem is to remove the extra variable, using

> rm(age)

and then attach your data frame.
2. What is the difference between SD and SE, and how do these quantities relate to Bayesian prior and posterior distributions?
I will explain this issue by copying here a message received via email and my answer.

>  I want to clarify my understanding of setting up a prior for an Inference for a single normal mean. When we specify the prior, we use the notation N(mean, sd or variance or SE), but I'm never sure if I should be stating the standard deviation or the variance or the standard error.

In frequentist analyses, SDs are used for population description purposes, i.e., to describe how one person varies from another on the measure of interest and within the population of interest. SEs are used to estimate how accurately a model parameter has been estimated. So, you might have SD = 10 to describe how blood pressure varies from one person to the next in a population, and then SE = 1, for example, once a sample has been collected, representing how accurately we now know the mean value.

Note that the SD is a population characteristic, and stays constant regardless of sample size, while the SE depends on the sample size. In general, the larger the sample size the smaller the SE.

When I used to teach 607, I handed out an article on this topic each year, and if you want to read more, you can find it on page 106 (or 109 of the pdf file) of my 607 course notes, which are here:

607 course notes

All of the above is for frequentist analyses. Bayesian analysis is quite different, since the term SE is never used, only SD, which is used for all variability related terms, and always represents the variability in whatever distribution is under discussion. So, if one is discussing a distribution representing person to person variability in a population, then the SD represents that variability. If one is discussing a prior distribution for a mean parameter, then the SD of that prior represents how accurately the researcher knows that mean value, before analyzing the data. If one is discussing the SD from the posterior distribution, then the SD represents how accurately one knows the mean value after analyzing the data.

So, to draw an analogy between frequentist and Bayesian analysis, the SD from the freq viewpoint is the same as the Bayesian SD when discussing person to person variability in a population. The SE from a freq viewpoint is similar to the Bayesian SD for the posterior distribution (but will be numerically similar only if little prior information is used). There is of course no freq equivalent to the Bayesian prior SD, as freqs ignore prior information.

>  Also, is this normal distribution we specify the distribution of where we think the mean value lies, or where we think and individual's value is likely to be? If it is a distribution for the mean, we would use the SE, but if it is a distribution for an individual's possible value, we would use either sd or variance, correct?

Yes, I think you have the right idea, but see above explanations, as Bayesians would not call it the SE. If setting up a prior distribution for the mean, the SD you need to represent your uncertainty about the mean value, and if discussing person to person variability, you would use the SD that represents population variability.

>  Also, I'm confused about generally the same issue for the posterior distribution... is the posterior distribution a probability distribution for where the mean lies, or the probability distribution of where an idividual's value may lie... and therefore what is the second parameter specified in the posterior notation N(71.69, 2.68) for example?

Your first guess is correct, the posterior distribution is a probability distribution for where the mean lies, after analyzing your data. So, in the above example, if 2.68 is the posterior variance, then sqrt(2.68) = 1.64 is the posterior SD, which represents how accurately the mean is known after data analysis.

>  My thoughts are that the prior is a distribution for the possible values for an individual, and therefore the second parameter should be the sd or variance,

No, this is not correct, as per above discussion. The prior is a distribution representing your knowledge about the mean parameter value.

>  and the posterior distribution is the probability distribution for the mean, and therefore the second parameter is the SE, but I would really like to hear your clarification on this.

Yes, that is correct, but, again, Bayesians would not call it the SE, and it would not be the same numerical value as the freq SE unless lile prior info is used.

One last point: Bayesians never need to use SE = SD/sqrt(n), since the SD from the posterior distribution already incorporates information from the data, so the sample size is "built-in" already.
3. Is there a way to get Odds Ratios from the output of the bic.glm program?
Yes. Suppose you have saved your results as something like:

output <- bic.glm(y ~ x, data =, family = "binomial")

Then typing

> exp(output$mle)

will give you the Odds Ratios from all coefficients from all models and typing

> exp(output$mle - 1.96*output$se)


> exp(output$mle + 1.96*output$se)

will give you the lower and upper 95% CI limits for the Odds Ratios, respectively.
4. Why does the AIC not depend on the scale of the Y variable used?
5. How do I create a data set that includes complete cases (i.e., cases with missing data are deleted) in R? I need this because some routines will not run with missing data.
It is extremely easy to create a data set of complete cases in R. Suppose you have a data frame called x that has some variables with missing items. Then run these commands:

> ok <- complete.cases(x)
> <- x[ok,] is now a data frame of complete cases only. The names "ok" and "" are of course arbitrary.
6. How can I create subsets of data sets when there are missing values?
Here are some examples (with thanks to Dominic Comtois):

> a <- c(1,0,3,NA,0,NA) # create a vector of some numbers and missing values

> # returns a logical value for every item in "a"


> as.numeric( # convert logical values to numerical if needed

[1] 0 0 0 1 0 1

> a[!] # subset of non-missing values of "a"

[1] 1 0 3 0

> which( # indexes for missing values of "a"

[1] 4 6

> which(! # indexes for non-missing values of "a"

[1] 1 2 3 5
7. We have learned about multiple imputations in WinBUGS, but can this be done in R?
Yes. The R package "mice" can be used to create multiple imputations in R. It is available in the usual way R packages are downloaded.
8. How do I recode data in R? For example, I have a variable that is categorical, but I want to convert it to a dichotomous variable.
Download the CAR package for R, and use the recode function. For example, suppose you have a categorical variable called dose with nine categories that you want to dichotomize. Suppose that it is now coded as 1, 2, 3, ..., 9, and you want to dichotomize at five, such that 1 through 5 are coded as 0, and the rest as 1. You can run this command:

dose.dichot <- recode(dose, "c(1,2,3,4,5)='0'; else='1'")

Your new variable dose.dichot will now be dichotomized.
9. This course comes out strongly against the use of frequentist hypothesis testing for making clinical decisions. Is this a minority opinion?
I have not taken any formal surveys. Nevertheless, the disadvantages of p-values have been discussed almost since they were proposed. There are literally hundreds of articles in the statistical and medical literature explaining the disadvantages of p-values, and expressing preferences for confidence intervals and/or Bayesian methods. As discussed in class (see page 13 of notes from January 6th lecture), major journals including the leading journal in our field, Epidemiology virtually ban the use of p-values or even discussions based on "significance" within their pages.

Check out these web pages for many references concerning p-values:

Here is a small sampling of other articles (and check their reference lists for many, many others):

Malakoff D. Bayes offers a ’new’ way to make sense of numbers. Science. 1999 Nov 19;286(5444):1460-4. (Review from Science about the rise of Bayesian analysis in Science.)

Dunson DB. Practical advantages of Bayesian analysis of epidemiologic data. American Journal of Epidemiology 2001 Jun 15;153(12):1222-6.

Goodman SN. Of P-values and Bayes: A modest proposal. Epidemiology. 2001 May;12(3):295-297.

Evans J, Mills P, Dawson.J. The end of the p-value? British Heart Journal 1988; 60:177-180.

Altman DG. Why we need confidence intervals. World J Surgery 2005;29(5):554-556.

Gardner MJ, Altman DG. Confidence intervals rather than P values: estimation rather than hypothesis testing. Br Med J. 1986 Mar 15;292(6522):746-750.

Altman DG, Gardner MJ. Calculating confidence intervals for regression and correlation. Br Med J. 1988 Apr 30;296(6631):1238-1242.

Gardner MJ, Altman DG. Using confidence intervals. Lancet. 1987 Mar 28;1(8535):746.

Taken together, you can see that the major scientific journals such as Science, as well as the two most respected epidemiology journals, including American Journal of Epidemiology and Epidemiology and major medical journals such as BMJ and Lancet, have published editorials encouraging methods other than frequentist hypothesis testing. As one last example, I have published an article in JAMA with Dr. James Brophy, another professor in our Department, comparing Bayesian credible intervals to frequentist p-values for clinical decision making. It summarizes many of the arguments made in our 621 course about why p-values are not useful for making any important clinical decisions.

Brophy JM, Joseph L. Placing trials in context using Bayesian analysis: GUSTO revisited by Reverend Bayes. Journal of the American Medical Association 1995;273(11):871-875.

If you search the web, you will find many other examples, as well as lecture notes from courses at other universities that contain similar material.
10. How do I remove all objects I have created in R?
Simply type the folowing at the R prompt:

> rm(list=ls())
11. I am confused about confounding. We have seen various methods to look for confounding, for example checking the basic conditions for confounding, looking at the difference in estimates in going from a univariate to a multivariate model, and using the $mle and $se output from the bic programs. How do I put all of this information together?
If you do not have access to the nicely formatted output provided by the BIC programs we have covered in class, then you need to run models "manually", one by one. While nothing stops you from looking at models other than univariate or full multivariate, it gets unweildy rather quickly if you have more than a few variables to consider. So, one often may just look at changes in coefficients between univariate and full multivariate models, and make do with that. What would be missing is extra information from the "intermediate" models, between univariate and multivariate, that may help to decipher which variables act as confounders for other variables. Sometimes one may want this level of detail, other times maybe not.

Which variables to leave in a "final model" depends on the purpose of the regression. If you are trying to create a model optimized for making future predictions, then using the model average as output by the BIC is optimal under certain conditions (see paper by Raftery for details).

If the purpose is to investigate the effects of one or more main variables while adjusting for confounding from other variables, then conclusions are drawn from running a series of models, such as those output by the BIC programs, or, in the absence of that, run manually one by one. For this purpose, there is no need to discuss only a single final model, because one can draw conclusions from the ensemble of models run.

Having said that, sometimes for reporting purposes, you might want to present a "final model". For example, most journals will not give you unlimited space to publish results from a series of models just to discuss confounding. If you are working under such restrictions, then you can start by learning about confounding by running as many models as you need to run. Then, after drawing conclusions, present a model that includes all variables that are important to the "story" you want to tell. So, here you would include all variables of primary interest, and all variables that you have concluded are confounders of your main variables. Also, you might include any other variables that have important effects. All other variables can then be omitted.
12. I have an ordered categorical variable. When modeling, should I treat it as a continuous variable, or as categorical?
It is sometimes better to treat ordered categorical variables as continuous, sometimes not.

If you insert them into a model as a continuous variable, it implies that for each unit increase, the effect on the outcome is the same. In your example, it implies that a switch from category 1 to category 2 has the same effect as switching from category4 to category 5.

If that seems reasonable, then you can leave them as continous. If that assumption is likely to be false, then you should switch to using a categorical variable, which will then estimate an effect for each level, so that the effects can differ depending on where they are in the order.

Two more points:

1. It may sometimes not be clear whether the assumption is likely to hold or not. In this case, you can plot the outcomes versus each category and see if there is a roughly linear trend. If yes, then leave as a continous variable, if not, switch to categorical.

2. If you have a very large number of categories but treating as a continuous variable is not reasonable, then you may want to "collapse" similar categories so that the total number of categories is decreased. Again, you can decide which categories are similar enough to collapse together by looking at a graph as suggested in point 1 above.