Lawrence Joseph
Bayesian statistics
|
|
Please Note: Prof. Joseph has retired. These pages are left up in case they prove useful, but the pages and software will no longer be updated. All material and software is "as is" with no guarantees of functionality or correctness. |
|
mpcpn / mpcpp
A Bayesian approach to multiple-path change-point analysis using the gibbs sampler for Normal and Poisson data
Summary
The set of procedures to perform multiple-path change-point analyses
using a Bayesian approach were written in FORTRAN 77 by Lawrence Joseph,
Roxane du Berger and David Wolfson and developed in a SunOS 4.1.2
operating environment on a SPARC station ELC. The method
was designed for two distributions: Normal and Poisson, each
packaged separately for purposes of compilation and linking.
The multiple-path change-point programs are referred to as
mpcpn for the Normal distribution and mpcpp for the Poisson.
Both programs share the same spirit with differences found in the
prior input. Prior information necessary to run the programs is
detailed further in this document.
Users are invited to consult the following manuscripts which discuss in full detail Bayesian approaches to change-point analysis:
Queries, questions or comments regarding the above manuscripts or programs may be e-mailed to Lawrence Joseph at lawrence.joseph@mcgill.ca . In this paper, we focus our attention on mpcpn and where relevant, will note any differences with mpcpp. The successful programming of the algorithms is largely due to the following sources for which we are thankful: Abramowitz, M., and Stegun, I.A. (ed.) (1972) Handbook of Mathematical Functions New York: Dover Publications Inc. Files
Once you open the package mpcpn.doc (Normal version) or
mpcpp.doc (Poisson version) with a shar program, you will
note the presence of the following files:
Ex.data README.MPCP itp.f postbet.f util.f Ex.input gen.f mpcp.f setup.f Makefile header.f post.f upd.f Makefile
For convenience, we have provided a Makefile to compile
and link the FORTRAN code into one executable file. The
command make works on a Sparc station. If you encounter
unnecessary difficulties, compile and link as follows:
If the user chose the set of procedures for the Normal distribution, type the command f77 -o mpcpn *.fIf the user chose the set of procedures for the Poisson distribution, type the command f77 -o mpcpp *.f Examination of parameter file
As stated earlier, the multiple path change-point analysis
requires as input a variety of information. For a detailed
discussion of this approach, users are invited to consult
the manuscripts referenced above.
We have included in this package a parameter file called Ex.input and an associated data file Ex.data to illustrate a typical analysis. In this section, we highlight inherent differences between the Normal (mpcpn) and Poisson (mpcpp) distributions. MPCP - Normal distribution (mpcpn)
The file Ex.data is a data set in the form of a
37 row by 10 column matrix, representing 37 different
sequences of 10 repeated measurements. For this data,
we provide in Ex.input the necessary parameter
inputs, which we examine below:
%cat -n Ex.input 1 37 10 530 !m rows, n columns, seed 2 10000 7000 !# cycles, last cycles retained 3 1000 1 500 !screenview,filedumps,maxfsize(kbytes) 4 f -99999 !missing (logical) + (-BIG) indicator 5 f 99999 !sample size for each row + (BIG) indicator 6 13 3 2 3 32 23 10 20 22 25 !row indices = min(10,m) 7 10 4 3 1 5 6 7 8 9 9 !column indices = min(10,n) 8 mp.g1 !Gamma parameter updates - distribution 1 9 mp.g2 !Gamma parameter updates - distribution 2 10 mp.n1 !Normal parameter updates - distribution 1 11 mp.n2 !Normal parameter updates - distribution 2 12 mp.al !Dirichlet parameter updates 13 mp.ad !avg (iterations) dirichlet updates 14 mp.ga !avg (iterations) E(X),X~Gamma 15 mp.am !avg E(X2)-E(X1),E(X)=updated Normal mean 16 mp.ap !Pr(no change) 17 16 16 !m prior gamma shapes 18 16 16 : : 52 16 16 53 16 16 54 .0025 .0025 !m prior gamma scales 55 .0025 .0025 : : 89 .0025 .0025 90 .0025 .0025 91 115 115 115 115 115 115 115 115 115 115 92 115 115 115 115 115 115 115 115 115 115 93 115 115 115 115 115 115 115 115 115 115 94 115 115 115 115 115 115 115 !m exp mean dist1 95 115 115 115 115 115 115 115 115 115 115 96 115 115 115 115 115 115 115 115 115 115 97 115 115 115 115 115 115 115 115 115 115 98 115 115 115 115 115 115 115 !m exp mean dist2 99 .25 .25 !omega 1 and 2 100 .0 .0 .0 .1 .1 .1 .1 .1 .1 .1 !n dirichlet parameters
MPCP - Poisson distribution (mpcpp)
The parameter file for the Poisson is simpler to define than
the Normal. Much of the information required by the Poisson is
similar to that given above, with fewer output files to declare
and fewer prior parameters to choose from.
The file Ex.data is a data set in the form of a 285 row by 8 column matrix, representing 285 different sequences of 8 repeated measurements. For this data, we provide in Ex.input the necessary parameter inputs, which we examine below: %cat -n Ex.input 1 285 8 582 !m rows, n columns, seed 2 10000 3000 !# cycles, last cycles retained 3 5000 1 500 !screenview,filedumps,maxfsize(kbytes) 4 t -99 !missing(logical) + (-BIG) indicator 5 f 110 !sample size for each row + (BIG) indicator 6 246 45 46 76 98 32 54 18 222 271 !row indices = min(10,m) 7 8 4 7 6 5 3 1 1 10 8 !column indices = min(10,n) 8 mp.g1 !Gamma parameter updates - distribution 1 9 mp.g2 !Gamma parameter updates - distribution 2 10 mp.al !Dirichlet parameter updates 11 mp.ad !avg (iterations) dirichlet updates 12 mp.ga !avg (iterations) E(X),X~Gamma 13 mp.am !avg E(X2)-E(X1), X~Gamma 14 mp.ap !Pr(no change) 15 1.00000 1.00000 !m prior Gamma shapes 16 1.00000 1.00000 : : 300 15.0000 15.0000 !m prior Gamma scales 301 15.0000 15.0000 : : 583 15.0000 15.0000 584 15.0000 15.0000 585 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 !n dirichlet parameters
Example of execution
Both programs require input files mpc.dat and mpc.par to read
the data matrix and the prior parameters, respectively. Thus, to
execute, type
% cat Ex.input >! mpc.par #required parameter file name % cat Ex.data >! mpc.dat #required data file nameIf the input files are created for the Normal distribution, type % mpcpnIf the input files are created for the Poisson distribution, type % mpcpp The program reads the contents of mpc.par and performs a number of checks. It creates dump files dump.dat, dump.par and, if missing data are present dump.mis, which the user can scan in order to ensure that all data and parameters are properly read. Further, during execution, messages appear on the screen, which inform the user on the program status. We feel obliged to print a large number of messages because of the variety of problems which can occur. Of course, errors are still possible and the user is encouraged to begin execution with a small number of cycles, to inspect all messages and dump files, before proceeding with a more reasonable number of cycles. Output description
As noted earlier, the Normal version (mpcpn) produces a total of
nine files and the Poisson (mpcpp) seven. The first five
are parameter updates at selected cycles for
1a) Gamma before change-point, 2a) Gamma after change-point,
3a) Normal before change-point (not applicable to Poisson),
4a) Normal after change-point (not applicable to Poisson)
and 5a) Dirichlet. Files 1a) to 4a) are updates at selected rows and
5a) at selected columns. The following four files are
information averaged over selected cycles: 1b) Dirichlet parameter
updates at each column, 2b) Expected values of Gamma parameter
updates before and after a row change-point, 3b) Difference
between two expected values before and after a row change-point,
where the direction is E(X2)-E(X1), and 4b) The probability of
no change, a Multinomial probability assigned to the last
column generated from Dirichlet parameter updates. This last file
is most useful at comparing the probability of no change in one
group to the probability of no change in another group.
We warn users that the program imposes size restraints on files 1a) to 5a), because they potentially contain enormous amounts of data. The parameter file mpc.par requires a row/column index selection (ten row/column indices for data containing more row/columns, less than ten for data containing fewer row/columns) and a maximum output file size (in kbytes). If the Gibbs sampler generates a reasonable amount of data (i.e. less than the specified size limit), information is provided for each of the selected row/column indices. However, exceeding this limit will cause the program to output only one row/column or even nothing. Obviously, files 1b) to 4b) are unaffected because they contain row/column/cycle summaries. The graphs displayed in the referenced manuscripts are constructed with the information contained in the output files. In the following paragraphs, we discuss how to create graphics of interest, with the help of a graphical tool such as Splus (version 3.1). We assume that the program has successfully executed and produced the following files:
In Splus, we read some preliminary information, such as the number of rows <nrowd> and number of columns <ncold> in data matrix, the Dirichlet priors <diric>, the number of rows and columns that are effectively output in the iterative files <nrowi, ncoli>. This information will be useful later in this document. # Number of rows and columns in data matrix nrowd _ as.numeric(unix("head -1 mpc.par | awk '{print $1}'")) ncold _ as.numeric(unix("head -1 mpc.par | awk '{print $2}'")) # Dirichlet parameter priors unix("egrep '!n dirichlet' mpc.par | awk '{print}' > tmp") diric _ scan(file='tmp',n=ncold) ; unix("rm tmp") # Number of rows and columns output to iterative files # (assuming they exist) unix("dat1=`egrep 'Gamma parameter updates - distribution 1' mpc.par | \ awk '{print $1}'` ; tail -1 $dat1 > tmp") nrowi _ length(scan(file='tmp'))/3 ; unix("rm tmp") unix("dat1=`egrep 'Dirichlet parameter' mpc.par | \ awk '{print $1}'`; tail -1 $dat1 > tmp") ncoli _ length(scan(file='tmp')) ; unix("rm tmp") GRAPH: Means of the marginal Dirichlet posterior distribution for the change point
Shown in the form of a bar graph, the heights of the bars
represent point estimates of the probability that a change
occurs in a given period. In this representation, the
vector file mp.ad is much more useful than the matrix
file mp.al. The latter, if space allows, contains cycled
Dirichlet parameter updates, however the column order depends
on the specified column index selection. On the other hand,
the elements of mp.ad are always based on the natural
order of the data matrix columns. To illustrate, we issue
the following Splus commands:
labdir _ c('mp.al') colsel _ scan(file=labdir,skip=7,n=ncoli) # Column index marpost _ matrix(scan(file=labdir,skip=18),byrow=T, ncol=ncoli) rsum _ nrowd+sum(diric) # Each row sum is identical marpost _ marpost/rsum # Divide each element by row sum marpost _ apply(marpost,2,mean) # Take the mean of each column marpost _ marpost[order(colsel)] # ``Natural'' column orderWith file mp.ad, we issue the simpler Splus commands: labdir _ c('mp.ad') marpost _ scan(labdir,skip=6)In either file, a plotting scheme is inspired from the following: openlook() # Or any valid graphics device labx _ c('measurement period') laby _ c('change-point \n marginal posterior mean') plot(marpost,ylim=c(0,max(marpost)+0.05),xaxt='n', type = 'n', lty = 1, xlab = labx, ylab = laby) axis(1,at=seq(1,length(marpost),by=1)) for(j in seq(along = marpost)) { segments(j-0.2, 0, j-0.2, marpost[j]) segments(j-0.2,marpost[j],j+0.2,marpost[j]) segments(j+0.2,marpost[j],j+0.2,0) } As a last comment, it is understood that a bar constructed over the last period constitutes the mean probability of no change. GRAPH: Marginal Posterior Beta Densities For The Probability of a Change at a Requested Period
To produce this graph, we require the Dirichlet parameter updates
file mp.al created over the selected column indices. Suppose
we are interested in the mean beta density of column four, and say
that it is positioned at the second column index of mp.al
(we assume that the specified file size limit is enough to produce
a matrix of Dirichlet updates of up to a maximum of ten columns).
We suggest the following Splus commands:
labdir _ c('mp.al') xduf _ matrix(scan(file=labdir,skip=18), byrow=T,ncol=ncoli) ba _ xduf[,2] # Second column index - 4th data column We define a valid domain for the Beta density (0<x<1) first over a limited number of points. More domain points can be added to obtain finer plots. Consider a random variable X whose distribution is Beta with parameters ba and bb and E(X) = ba/(ba+bb). We define ba as the vector of Dirichlet parameter updates over a given column (here column 4); let rsum be the number of rows in data matrix plus the sum of Dirichlet priors; then bb=rsum-ba. To compute the Mean Posterior Beta Density, we provide the Splus function post.beta: dec _ 100 # Number of domain points xq _ seq(0.00001,0.99999,length=dec) rsum _ nrowd+sum(diric) bb _ rsum - ba fxbet _ post.beta(xq,ba,bb) # Mean Beta Density post.beta _ function(x,a,b){ # Compute beta distribution at each # domain point x averaged over beta # parameter vectors a,b # Assume that Fortran procedure is in dyn.load res _ .Fortran('postbet', m = as.integer(length(a)), n = as.integer(length(x)), fk = as.double(x), as.double(x), as.double(a), as.double(b))$fk return(res) } The user will note the Splus FORTRAN call to postbet. A similar procedure written in Splus would take too long to execute. Assuming that the object code postbet.o is in the current working directory, make a single call with dyn.load('postbet.o') during your Splus session. The source code is filed in the multiple-path change-point program package. The following plotting scheme is suggested for the mean Beta density fxbet: openlook() # Or any valid graphics device labx _ c('') laby _ paste(c('marginal posterior density \n'), c('given Dirichlet column updates ')) plot(xq[-c(1,length(xq))],fxbet[-c(1,length(fxbet))], type = 'l',lty = 1, xlab = labx, ylab = laby) Often, the above procedure is useful to compute the Beta densities of no change, i.e. the last column of the Dirichlet parameter updates. Of particular interest is when we have at our disposal two groups (say, a control and a treatment group). Besides the Beta densities of no change, we can compute the probablity that subjects in one group are more likely to change than subjects in another group. This can be estimated using a Mann-Whitney-type statistic (see manuscripts for details). Assume we have files mp.ap1 and mp.ap2 (not shown here) which represent the probability of no change in data sets 1 and 2, respectively. The Mann-Whitney statistic is computed as follows: labnoch1 _ c('mp.ap1') labnoch2 _ c('mp.ap2') noch1 _ scan(labnoch1,skip=9) noch2 _ scan(labnoch2,skip=9) prnoch _ prob.nochange(noch1,noch2)where the Splus function prob.nochange is displayed below: prob.nochange _ function(pn1,pn2) { # Mann-Whitney statistic wm # wm = wilc - term, wilc = Wilcoxon, # term = 0.5*k*(k+1), k=vector length of both pn1,pn2 # For large k, wm = wilc*sqrt(Var(wilc)) + e(wilc) - term, # wilc=uncorrected Z, # Pr(Pr(pn1)<Pr(pn2)) = 1 - wm/k^2 k <- length(pn1) if (k!=length(pn2)) stop('sample sizes different') term <- 0.5 * k * (k + 1) wilc <- wilcox.test(pn1, pn2, correct = F)$statistic if(names(wilc) == "rank-sum statistic W") { wm <- wilc - term } else { ewilc <- 0.5 * k * (2 * k + 1) svars <- sqrt((1/12) * k^2 * (2 * k + 1)) wm <- wilc * svars + ewilc - term } pwm _ 1 - wm/k^2 return(ifelse(pwm>=0&pwm<=1,pwm,NA)) } GRAPH: Difference in rates before and after a change-point for a given subject (row)
Here, we focus on the Gamma parameter updates over a
selected row. Files mp.g1 and mp.g2 represent Gamma
parameter updates for Distributions 1 and 2, respectively.
The first two columns represent the Gamma shapes and scales,
where E(X)=shape*scale. The last column indicates whether
or not a change-point has occurred (yes=1, no=0). Given the
parameter update values, we wish to assign to each positive
domain point, a Gamma density averaged over the number of
Gibbs cycles. Note that in Distribution 2, only updates for
which a change has occurred are used for calculations.
labgam1 _ c('mp.g1') labgam2 _ c('mp.g2') rowsel _ scan(file=labgam1,skip=9,n=nrowi) # Row index k _ 8 # Choose row index element that exists if (k>length(rowsel)|k<1) stop('\n \n ****** bad index element') ind _ (k-1)*3 + 1 # Matrix column element #Gamma updates for Distributions 1 and 2 gam1 _ matrix(scan(file=labgam1,skip=20),byrow=T, ncol=3*length(rowsel)) gam2 _ matrix(scan(file=labgam2,skip=20),byrow=T, ncol=3*length(rowsel)) #Gamma shape and scale - Distribution 1 a1 _ gam1[,ind] b1 _ gam1[,ind+1] #Gamma shape and scale - Distribution 2 (change-points only) a2 _ gam2[gam2[,ind+2]>0,ind] b2 _ gam2[gam2[,ind+2]>0,ind+1] #Domain determination (*****to be refined to taste*****) #Care should be taken in choosing domain points. In the #beginning, use only a limited number of domain points. #Once the location of f>0 is found, add more domain points to #obtain a smoother plot. This is necessary, because Splus #can run into memory allocation problems. dec _ 20 xq1 _ seq(0,1,length=dec) xq2 _ seq(1,8,length=dec) xq _ unique(c(xq1,xq2)) #Compute Gamma densities fx1 _ dens.gam(xq1,a1,b1) fx2 _ dens.gam(xq2,a2,b2) For convenience, procedures dens.fgam and f.gam (called by dens.gam) are given below: dens.gam _ function(x,a,b) { # compute Gamma density whose domain is x and # vector elements parameters a,b # Output is mean density at each domain point. mrows _ length(x) ncols _ length(a) fmat _ t(matrix(x,mrows,ncols)) fmat1 _ apply(fmat,2,f.gam,a,b) return(apply(fmat1,2,mean)) } f.gam _ function(x,a,b) { #Divide the domain points by scale b to make #density function scaleless (i.e. dgamma(a,1)) #Divide result by scale b to make density function #scaled (i.e. density gamma(a,b)) #return(dgamma(x/b,a)) #scaleless result return(dgamma(x/b,a)/b) #scaled result } A plotting scheme is given as follows: openlook() # Or any valid graphics device labx _ paste('observation number',rowsel[k]) laby _ paste('probability density') plot(c(xq[1],xq[length(xq)]),c(0,max(fx1,fx2)+0.01), type ='n', lty = 1, xlab = labx, ylab = laby) lines(xq1,fx1,lty=6) # Rate before change lines(xq2,fx2,lty=1) # Rate after change leg.names _ c('Rate before change', 'Rate after change') legend(max(xq)*0.5,max(fx1,fx2)*0.5,leg.names, bty="n", lty=c(6,1)) The above were demonstrations of possible graphical analysis and estimates. Further insight can be gained by generating, for instance, summary figures for the differences between two distributions (see the referenced manuscripts). It is also possible to output other distributions of interest simply by modifying the multiple-path change-point programs. Most of the information of interest is found in the FORTRAN file itp.f . Note of general interest
It is a good idea to run mpcp with different seeds and compare
results. Substantial differences which occur from run to run
may be caused by lack of convergence. This could require a
greater number of Gibbs cycles or a change in prior parameters
or both. It is not surprising to note more variation in the
Normal distribution than in the Poisson, because of the greater
number of priors to choose from. In all cases however, it is wise
to scrutinize the parameter file and the results. Choice
of seed in this program ranges from -2147483648 to 2147483647.
Copyright
© Copyright Lawrence Joseph, David Wolfson and Roxane du Berger 1994
mpcpn and mpcpp are programs written by Lawrence Joseph, Roxane du Berger and David Wolfson at the Division of Clinical Epidemiology, Department of Medicine, Montreal General Hospital. These programs are an implementation of the manuscripts mentioned at the beginning of this document. You are free to use these programs, for non-commercial purposes only, under two conditions:
Please e-mail all comments, questions or responses to lawrence.joseph@mcgill.ca . |