Tuesday, July 26, 2011

Digging into JAGS

I've heard people characterize Martyn Plummer's JAGS as "partially interpreted" or "interpreted".  This typically comes up as an explanation for why it's "slow". I've always been puzzled by this claim: it's not!  The original specification in the BUGS language is compiled into a DAG using yacc/flex (I'm pretty sure).  The DAG is stored in memory with one node for each scalar quantity or entry in vector/matrix.  There are two important performance implications:

First, large models blow up in size fairly quickly since each vector/matrix entry has to be stored as a node.  It's a less efficient format than, for example, a matrix, but that's because it carries additional information (e.g.-a name, which you are very thankful for when debugging).

Second, when updating parameters, the algorithm must walk the entire graph to update one iteration.  That's a lot of steps and this does impact speed.  One could make more efficient samplers which updated batches of related parameters, but that requires an algorithm to figure out how to do that based on an analysis of the DAG.  I assume that's what the GLM module must do in order to discover which parameters can be block updated, but I haven't looked at that part of the JAGS C++ code yet.

Finally, one question which comes up frequently is whether JAGS can use multiple cores/processors.  The current answer is generally no, except when the underlying linear algebra library is multi-threaded.  Even when that is possible, and you have code like A %*% b, you might get a performance penalty just because the matrix multiplication is re-done every time that an entry in 'b' is updated.  The best possibility I can see for multi-core JAGS (take it with a grain of salt, I'm just sticking my node into the C++ code) is to use the conditional independence in a complex model to hand out piece of the DAG to separate cores for updating.

The way this currently works is that JAGS does a topological sort of the DAG such that if A|B, B is updated before A.  This is a partial ordering because if A|B and A|C, the updating order could be B->C->A or C->B->A.  I don't know if JAGS always comes up with a single order, or if it gets switched depending on the run.  This is where there's some opportunity for parallelism because B and C could be updated on separate cores, followed by updating A.  There would be some overhead involved in figuring out when to use multiple cores, and in parsing out pieces of the graph to different processes so it's not clear what sort of benefits we would get (e.g.-B and C could be sizable sub-graphs).  I know there are heuristic algorithms out there for solving this problem, and an illustration in R for a random DAG goes something like this:

library(igraph) # Make a DAG g <- barabasi.game(10) # Show it plot(g,layout=layout.kamada.kawai) # Calculate which parameters each parameter is conditional on parentage <- list() for(i in 0:9) {     parentage[[as.character(i)]] <- setdiff(subcomponent(g,i,'in'),i) } # Calculate group for calculation: group <- vector(mode='numeric', length=length(parentage)) names(group) <- names(parentage) for ( i in names(parentage) ) {     group[i] <- length(parentage[[i]]) } # Sort group from fewest to most parents: sampleOrder <- names(sort(group)) # Split into groups for multi-core sampling: sampleGroups <- split( x = sampleOrder, f = factor(sort(group)) )

The process goes something like this:
1. spawn a process to calculate density for each level i node.
2. check for results at level i, when results arrive, check if
any level i+1 node has all its dependencies satisfied.
1. yes: restart a process there with i+1 at #1