Tuesday, July 26, 2011

Digging into JAGS

I've heard people characterize Martyn Plummer's JAGS as "partially interpreted" or "interpreted".  This typically comes up as an explanation for why it's "slow". I've always been puzzled by this claim: it's not!  The original specification in the BUGS language is compiled into a DAG using yacc/flex (I'm pretty sure).  The DAG is stored in memory with one node for each scalar quantity or entry in vector/matrix.  There are two important performance implications:

First, large models blow up in size fairly quickly since each vector/matrix entry has to be stored as a node.  It's a less efficient format than, for example, a matrix, but that's because it carries additional information (e.g.-a name, which you are very thankful for when debugging).

Second, when updating parameters, the algorithm must walk the entire graph to update one iteration.  That's a lot of steps and this does impact speed.  One could make more efficient samplers which updated batches of related parameters, but that requires an algorithm to figure out how to do that based on an analysis of the DAG.  I assume that's what the GLM module must do in order to discover which parameters can be block updated, but I haven't looked at that part of the JAGS C++ code yet. 

Finally, one question which comes up frequently is whether JAGS can use multiple cores/processors.  The current answer is generally no, except when the underlying linear algebra library is multi-threaded.  Even when that is possible, and you have code like A %*% b, you might get a performance penalty just because the matrix multiplication is re-done every time that an entry in 'b' is updated.  The best possibility I can see for multi-core JAGS (take it with a grain of salt, I'm just sticking my node into the C++ code) is to use the conditional independence in a complex model to hand out piece of the DAG to separate cores for updating.

The way this currently works is that JAGS does a topological sort of the DAG such that if A|B, B is updated before A.  This is a partial ordering because if A|B and A|C, the updating order could be B->C->A or C->B->A.  I don't know if JAGS always comes up with a single order, or if it gets switched depending on the run.  This is where there's some opportunity for parallelism because B and C could be updated on separate cores, followed by updating A.  There would be some overhead involved in figuring out when to use multiple cores, and in parsing out pieces of the graph to different processes so it's not clear what sort of benefits we would get (e.g.-B and C could be sizable sub-graphs).  I know there are heuristic algorithms out there for solving this problem, and an illustration in R for a random DAG goes something like this:


library(igraph)

# Make a DAG
g <- barabasi.game(10)

# Show it
plot(g,layout=layout.kamada.kawai)

# Calculate which parameters each parameter is conditional on
parentage <- list()
for(i in 0:9) {
    parentage[[as.character(i)]] <- setdiff(subcomponent(g,i,'in'),i)
}

# Calculate group for calculation:
group <- vector(mode='numeric', length=length(parentage))
names(group) <- names(parentage)
for ( i in names(parentage) ) {
    group[i] <- length(parentage[[i]])
}

# Sort group from fewest to most parents:
sampleOrder <- names(sort(group))

# Split into groups for multi-core sampling:
sampleGroups <- split( x = sampleOrder, f = factor(sort(group)) )


The process goes something like this:
  1. spawn a process to calculate density for each level i node.
  2. check for results at level i, when results arrive, check if
       any level i+1 node has all its dependencies satisfied.       
    1. yes: restart a process there with i+1 at #1
    2. no: return to #2
It's a little more complicated because you don't necessarily have enough cores for all the nodes at one level, you need to group pieces of the graph to minimize interprocess communication, etc... but it's been done.
  
Now if only I could win the lottery and pay somebody to do this...

Metropolis alogrithm as an R closure

I've been wanting to understand closures in R a little better for a while and MCMC algorithms provide a nice use case.  In coding an MCMC algorithm, it's easy to write a function to take you one time step forward.  In most naive implementations what you see is a pile of code that manages the data and calls that function repeatedly.  The point of a closure is that you can wrap the code for tracking the data around the function which takes the MCMC steps and have a nice package which is easy to call repeatedly.  A quick implementation looks like so:


mcmcChain <- function(logTarget,proposal,cntrl) {
    if ( !all(c('nIter','inits') %in% names(cntrl)) ) {
        stop('\'cntrl\' argument must be a list with \'nIter\', and \'inits\' elements')
    }
    out <- matrix( data = NA, nrow = length(cntrl[['inits']]), ncol = cntrl[['nIter']] )
    out[,1] <- cntrl[['inits']]
    logD <- vector( mode = 'numeric', length = cntrl[['nIter']] )
    logD[1] <- logTarget(out[,1])
    logA <- vector( mode = 'numeric', length = cntrl[['nIter']] )
    prop <- vector( mode = 'numeric', length = cntrl[['nIter']] )
    chain <- function() {
        for ( i in 2:cntrl[['nIter']] ) {
            prop[i] <<- proposal(out[,i-1])
            logD[i] <<- logTarget(prop[i])
            logA[i] <<- logD[i] - logD[i-1]
            if (logA[i] > 0) {
                out[,i] <<- prop[i]
            } else {
                out[,i] <<- out[,i-1]
            }
        }
        o <- list(
            sample = out,
            proposal = prop,
            logD = logD,
            logA = logA
        )
        return(o)
    }
    return(chain)
}

The assumptions for this code are that the log target density and the proposal density take their parameters in a vector and return a vector with the same structure.  The output is a list with the sample, all the proposed values, log density, and log acceptance ratio.  It can be run to fit a normal distribution to three concocted values (0.1,0,-0.1) like so.  The sample is actually generated with the out <- tp() call.


logTarget <- function(mu) {sum(dnorm(x = c(0.1,0,-0.1), mean = mu, sd = 1,log=TRUE))}
proposal <- function(mu) { mu + runif(1,-0.1,0.1) }
cntrl = list( nIter = 1000, inits = 3)
tp <- mcmcChain(
    logTarget = logTarget,
    proposal = proposal,
    cntrl = cntrl
)
par(mfrow=c(2,3))
out <- tp()
plot( x = 1:cntrl[['nIter']], y = out$sample, pch = '*', xlab = 'iteration', ylab = expression(mu) )
plot( x = 1:cntrl[['nIter']], y = out$logD, pch = '*', xlab = 'iteration', ylab = 'log density')
plot( x = 1:cntrl[['nIter']], y = exp(out$logA), pch = '*', xlab = 'iteration', ylab = 'a')
hist( x = out$sample, xlab = expression(mu), breaks = cntrl[['nIter']]^(1/2) )
hist( x = out$logD, xlab = 'log density', breaks = cntrl[['nIter']]^(1/2) )
hist( x = exp(out$logA), xlab = 'a', breaks = cntrl[['nIter']]^(1/2) )


This is pretty flexible as you can substitute any target density and proposal density, and in my own work, the first time I wanted to do something more complicated was in dealing with structured data (e.g.- mark-recapture data). Speed can be an issue with MCMC samplers written in R, but in any complex problem it would generally be the calculation of the target density which matters most.  That could be implemented with in C++ (using Rcpp) and used with the above code. Please let me know (to satisfy my curiosity) if you use this code.

Saturday, July 23, 2011

Monday, July 18, 2011

Minimalist MCMC, in R

Christian Robert recently posted a minimalist piece of R code implementing a Metropolis sampler for an arbitrary target function. Here's the code with a few edits:

it <- 10^5
target = function(x) (x>-2)*(x<2)*dnorm(x=x, mean=0, sd=1)
mcmc=rep(0,it)
for ( t in 2:it ) {

   prop = mcmc[t-1] + runif(n = 1, min = -0.4, max = 0.4)
   if ( runif(1) < target(prop) / target(mcmc[t-1]) ) {
      mcmc[t] = prop
   } else {
      mcmc[t] = mcmc[t-1]
   }
}
hist(mcmc)
plot( x = 1:it, y = mcmc, pch = '*' )




This was just throw-away code used to answer a simple question about MCMC, but it shows that the code can be pretty minimal.  People writing their first MCMC samplers, even if they are familiar with programming often make a predictable series of mistakes and writing some example samplers which deal with these problems in this format might be a useful approach to guiding them through it.

Some of the easily avoided problems which trip people up:
  1. Trying to work with densities, rather than log-densities.  It fails fairly quickly.
  2. Failing to deal with boundaries of parameter space (the sampler above _does_ deal with them in the target function).
  3. Ignoring obvious caching opportunities.
Those are probably enough goals for minimal MCMC code for now.

Saturday, July 16, 2011

Histogram in Terminal

I'm sure there's an R packages with this kind of thing out there, but I didn't find it.  Often when working with MCMC on a server over ssh, you need to get a rough idea about parameter distributions and convergence.  This code outputs a reasonable histogram on the command line without needing graphics.  Something similar for traces would be nice.



    histo <- function(x,...) {
        o <- hist(x, plot = FALSE, ...)
        w <- options()[['width']]
        labels <- signif(o[['mids']],2)
        pad <- max(nchar(labels))
        densities <- o[['density']]/max(o[['density']]) * (w-pad)
        for ( i in 1:length(densities)) {
            padI <- pad - nchar(labels[i]) + 1
            if ( labels[i] >= 0 ) {
                cat(' ', labels[i], rep(' ',padI-1), '|', sep='')
            } else {
                cat(labels[i], rep(' ',padI  ), '|', sep='')
            }
            cat(rep('*',densities[i]), sep='')
            cat('\n', sep='')
        }
        return(o)
    }

    h <- histo(tp, breaks = 30)

The height (horizontal) of the histogram bars comes from the options, so if it's printing too wide for you, set "options(width = ???)" to something that works.