The Singing Dodo: September 2011

Friday, September 30, 2011

JAGS, dinterval

Edit: I posed this question to Martyn Plummer and got a very useful response. I'll post it once I figure out the details.

One of the hazards of using something like JAGS, with complicated innards written by someone else, is that you don't have much of a chance to the problems coming ahead. I happen to have a continuous process which is observed only in discrete units and it happens to mimic the dinterval distribution in JAGS. Unfortunately I recently had to change the number of breaks used by dinterval by an order of magnitude and it turns out that it bogs down fairly quickly. Here's an example:

cat('
model{
    for ( i in 1:N) {
        x[i] ~ dunif(0,10^K)
        o[i] ~ dinterval(x[i],breaks)
    }
}
', file = './dintervalProblem.bugs'
)

library(rjags)
mod <- list()
samp <- list()
timings <- list()
k <- c(3,4,5)
for ( i in 1:length(k) ) {
    N <- 100
    K <- k[i]
    x <- runif(N,0,10^K)
    o <- ceiling(x)
    breaks <- seq(0,10^K,1)

    timings[[i]] <- system.time(
        {
            mod[[i]] <- jags.model(
                file = './dintervalProblem.bugs',
                data = list(
                    N = N,
                    K = K,
                    o = o,
                    breaks = breaks
                ),
                inits = list(
                    o = o,
                    x = x
                ),
                n.chains = 1,
                n.adapt = 1000,
                quiet = FALSE
            )

            samp[[i]] <- jags.samples(
                model = mod1,
                variable.names = c('x','o'),
                n.iter = 1000,
                thin = 1
            )
        }
    )

}

My digging in the JAGS internals hasn't turned up anything yet which could be responsible for this behavior. The value of the dinterval node is calculated by:

static unsigned int value(vector const &par, unsigned int ncut)
{
    double t = T(par);
    for (unsigned int i = 0; i < ncut; ++i) {
  if (t <= CUTPOINTS(par)[i])
      return i;
    }
    return ncut;
}

It's not like there's a separate node for every breakpoint... though the number of nodes generated in the graphs does go up with K.

Suggestions? (Especially all those folks coming here from Andrew Gelman's blog...)

Thursday, September 29, 2011

Columbia Workshop, Katia Koelle would've been a good participant

I've been wracking my brain for other non-obvious applications of the Euler characteristic to biological data and I was reminded of some work[1] done by Katia Koelle. My understanding of the work is that she sought to characterize whether a virus evolved in the face of herd immunity by accumulating a wide diversity of anitgenic variants or whether it would tend to avoid herd immunity via replacement of older antigenic variants by newer ones. On a quick re-reading of her paper (I've seen this work presented at ESA twice), I don't think there is much of a connection with Robert Adler's pursuit of stochastic algebraic topology, but even if there was I'm probably not the person to see it.

Figure 1: Katia Koelle's opening slide at the 2009 ESA in ABQ, NM looked more or less like this. I found it very disturbing every time somebody sneezed during her presentation.

On the other hand I've found this work to be an impressive combination of practical insight combined with the sort of mathematical and statistical detail that the Columbia workshop was targeting. It's a shame nobody thought to invite her.

-----------------------------------------------
[1]Koelle, K., Ratmann, O., Rasmussen, D.A., Pasour, V., Mattingly, J. (2011) A dimensionless number for understanding the evolutionary dynamics of antigenically variable RNA viruses. Proceedings of the Royal Society, Series B.

Wednesday, September 28, 2011

Columbia Workshop, Robert Adler

Warning: my understanding of these talks was at times pretty thin and any error in discussing the topics are almost certainly mine.

Robert Adler's talk focused on uses of the Euler characteristic to summarize complex shapes such as the brain in 3-D. The fact that a number derived by algebraic topologists would nicely do that sort of thing for you is not a big surprise---I gather that's one of goals of the field[1]. It was interesting to hear the Euler characteristic defined in terms of the pattern of adding and subtracting alternating features of a shape because it reminded me of a children's book I recently read to my son which uses a number like that as part of a puzzle.

Figure 1: Algebraic topology is everywhere! My son now knows how to calculate the Euler characteristic for solid straight-sided shapes.

I went back and checked and it turns out the definitions are identical and indeed the Euler characteristic is defined in that book for polyhedra.

More interesting from a data visualization point of view was the presentation of a (I gather common) method for representing the components (e.g.- verteces, edges, and faces in the case of polyhedra) which make up a complex shape and how they are "born" and "die" when a descending threshold is applied to the object in one dimension.

The thresholding is best explained using a landscape (mountains, valleys, that sort of thing) in three dimensions. The threshold is a plane at a set height in 3-D space, starting well above the landscape. As the threshold descends it first contacts the highest points on the landscape and those features are "born" in the visualization (barcode), as it descends through the landscape there is a set of rules for when new components are "born" or "die". Robert Adler was kind enough to mention some of these topological rules in passing to give us a sense of their flavor but he spared us the details[1].


Figure 2: Thresholding on a 4-D space with the resulting barcode (3 categories of bars). Each bar represents the lifespan from birth (on left) to death (on right) of a feature. This could be applied to summarizing the brain. [2]

My understanding of the main point of the talk was that these barcodes were a convenient way of summarizing multi-dimensional shapes and they were currently being applied to problems such as the comparison of brain scans from alzheimer's patients and matched controls. According to Adler, the main difficulty with making these comparisons (not that it's stopping anybody) is that there is no good understanding of the statistical properties of these barcodes, therefore no method for defining a null hypothesis or statistical model for the data (what would non-informative priors on barcodes look like anyway?)

Obviously these techniques are applicable to biology since brain scans were one of the examples Adler discussed, but I wonder if there are other less obvious places where these approaches are relevant. After all, the barcodes are generated by thresholding an n-1 dimensional object living in an n-dimensional space. It doesn't have to be a physical object. I assume that you need some local autocorrelation to get any coherent barcode (think of the autocorrelation of height in the 2-D landscape in 3-D space), but other than that there seem to be few requirements for crunching the numbers. This is, of course, a discussion I should have tried with Dr. Adler while he was in front of me, but I hadn't digested the talk yet so I missed my chance.

I guess the question I need to spend some time with is: what other types of spaces can be summarized in this way? The local autocorrelation in the thresholded values gives some clues but it's not very clear. For example, can pedigrees be visualized this way (based on the continuity in genetic similarity)? Phylogenies more generally? Is there a point to that summary?

Figure 3: The Alma Mater at Columbia says "wha'dup!"

-----------------------------
[1] Hurrah for fields and sub-fields!
[2] http://webee.technion.ac.il/people/adler/research.html

The Flunge

In foil and epée fencing, the flêche is a beautiful maneuver used to quickly close distance when the attacker is fairly certain of getting a touch. If it looks like the attacker is running (i.e.-upright) they're doing it wrong.

Figure 1: A nice successful flêche, in epée.*

In sabre fencing crossing one's feet while moving forwards is against the rules (I hear the sport devolved into too much charging and slashing so they eliminated the charging, kept the slashing. I prefer it that way). However, trying to get sabreurs to calm down is generally counter-productive so the flunge quickly arrived on the scene. Now the goal is to score the touch before your nose hits the floor. It's still a rather artistic move.

Figure 2: A nice flunge, in sabre. There really is no going back from that.

Figured that picture could use a few words.

----------------------------------
* Thank you: http://www.flickr.com/photos/adonovan/418225607/lightbox/

Tuesday, September 27, 2011

Columbia workshop, pre-pre-post

Note: a prelude to writing about two and a half days of math.

Much of ecology and evolutionary biology lives firmly in the world of small data[1] and the field has done quite well (thank you very much) with regression and [AN,MAN,AMO,ANC]OVA[2]. There's a pretty strong tradition of teaching good frequentist statistics and my own introduction to that, by Bill Bradshaw, was fairly complete (if a little brutal). Bill's approach was to go step by step, and thoroughly cover the mechanics of each technique he presented.

Bill's course had to be a little brutal because the mathematics requirements for biology are generally pretty low. We biologists act as though algebra is frightening to students, linear algebra should only be brought up in passing and quickly swept under the rug, and calculus is beyond the pale. Anyone who would like to avoid that general attitude is faced with the problem that most biology students are unprepared for moderately mathematical subjects. Bill's solution was to go at it like his was the only course students had signed up for and expect everyone to keep up. Much of the class looked something like this:


Figure 1: Bill Bradshaw leads the charge against the axis of evil (ANCOVA, ANOVA, and regression).

The general attitude Bill maintained was that you needed a solid statistical background to do ecology and there was no reason to doubt that a biologist could "get it". Besides, as mathematically inclined[3] types are likely to tell you, the math you need to get into for the purposes of understanding basic statistics (in the mathematical sense), hardly counts as math.


Figure 2: Peter Pappas says, "you see this?", "This is _NOT_ math."

The real barrier to introducing more advanced math to biology courses (you know, beyond spreadsheets) is that math builds on itself. As a biologist I might encounter it first in a population ecology book as matrix multiplication and my eyes would just conveniently glaze over until the little "eq. 1" on the right margin passes. The fact is that a linear algebra class devotes significant time to simple mechanics and its not surprising that an untrained person will not just "get it" on the spot. That means either raising the bar for course requirements or building the time for math into biology courses. Unless your whole department is excited about raising standards, the second route is probably the only one that's open the an instructor. You just have to go it alone.

Figure 3: The author, about to pwn calculus in front of an ecology class (without departmental support).

All this is to say that there's a gulf between biology and mathematics/statistics which is damaging to the quality of work in biology. I think it also shows up in published work, but that's for another day (XKCD #303 only holds for so long). I'm interested in advances in mathematics and statistics, and how they are relevant broadly to biology research (as well as my own work). This led me to spend two and a half lovely days in NYC this past weekend attending a workshop at Columbia University which I intend to write about, once I've had the opportunity to digest the lectures a little more.

-----------------------------
[1] Yes, I know, NEON will change everything, as will Data Dryad, etc...
[2] You know, it's robust to stuff.
[3] The word "inclined" should have been linked to Peter Pappas, Vassar College professor of mathematics who enjoyed referring to calculus/linear algebra/multivariate calculus as "trivial" and "not math", much to the dismay of his ~~captives~~ students. Can't find him right now though.

Public Service Announcement

You know those vials in the freezer? The ones that would really love to have a date with some propanol/hexane before getting shot into the HPLC? The fish scales sitting in envelopes begging to have their rings counted? The bird feathers you really think you should spec just because they're there? If you don't know what question you want to use them to ask, they're not data, they're a waste of time.

Thank you for listening to this announcement.

Had this been a real emergency, your PI would have run screaming into the room waiving those vials in the air.

Wednesday, September 14, 2011

HPV vaccine

At the recent debate of hopefuls for the Republican presidential nomination, Michelle Bachman apparently made a series of ugly false claims about the safety record of the Human Papilloma Virus (HPV) vaccine. It was the usual dribble of fear-spasms calculated to appeal to a rabid anti-feminist sub-group of the electorate.

It was ugly because HPV kills: along with the initial, rather unpleasant* disease, HPV causes cervical, vulvar, vaginal, penile, anal, and oropharyngeal cancer. In fact it causes most cervical cancers which mean that the HPV vaccine might be the first pharmaceutical which lets us cure a cancer. HPV vaccine also has a shiny safety record, better than the safety record of Tylenol/aspirin/ibuprofen which I'm sure Bachman regularly throws down her gullet. So, here's the American Academy of Pediatrics (AAP) to set the record straight:

“The American Academy of Pediatrics would like to correct false statements made in the Republican presidential campaign that HPV vaccine is dangerous and can cause mental retardation. There is absolutely no scientific validity to this statement. Since the vaccine has been introduced, more than 35 million doses have
been administered, and it has an excellent safety record."

Furthermore, this is only an effective way to cure cancer population-wide if it happens in the right time-frame, i.e.-before sexual activity.

“The American Academy of Pediatrics, the Centers for Disease Control and Prevention, and the American Academy of Family Physicians all recommend that girls receive HPV vaccine around age 11 or 12. That’s because this is the age at which the vaccine produces the best immune response in the body, and because it’s
important to protect girls well before the onset of sexual activity. In the U.S., about 6 million people, including teens, become infected with HPV each year, and 4,000 women die from cervical cancer. This is a life-saving vaccine that can protect girls from cervical cancer.”

Dear parents: do it. Dear everybody else: if you're considering voting for this woman, or any other Republican candidate, please convince them to engage in political grandstanding less likely to cause the death of women and the destruction of families.
---------------------
* Sorry if you clicked that.

Sunday, September 11, 2011

Three parameter t distribution (in JAGS, but not numpy.random)

I can't seem to find a three-parameter t-distribution in numpy/scipy and I know JAGS has one. It turns out it comes from the "jrmath" library which is a part of R.

/*
 *  Mathlib : A C Library of Special Functions
 *  Copyright (C) 1998 Ross Ihaka
 *  Copyright (C) 2000-2008 The R Development Core Team
 *
 *  This program is free software; you can redistribute it and/or modify
 *  it under the terms of the GNU General Public License as published by
 *  the Free Software Foundation; either version 2 of the License, or
 *  (at your option) any later version.
 *
 *  This program is distributed in the hope that it will be useful,
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 *  GNU General Public License for more details.
 *
 *  You should have received a copy of the GNU General Public License
 *  along with this program; if not, a copy is available at
 *  http://www.r-project.org/Licenses/
 *
 *  DESCRIPTION
 *
 *    Pseudo-random variates from a t distribution.
 *
 *  NOTES
 *
 *    This function calls rchisq and rnorm to do the real work.
 */

#include "nmath.h"

double rt(double df, RNG *rng)
{
    if (ISNAN(df) || df <= 0.0) ML_ERR_return_NAN;

    if(!R_FINITE(df))
  return norm_rand(rng);
    else {
/* Some compilers (including MW6) evaluated this from right to left
  return norm_rand(rng) / sqrt(rchisq(df) / df); */
  double num = norm_rand(rng);
  return num / sqrt(rchisq(df, rng) / df);
    }
}

And from modules/bugs/distributions/DT.cc:

double DT::r(vector const &par, RNG *rng) const
{
    return rt(DF(par), rng) / sqrt(TAU(par)) + MU(par);
}

Which gets you the three-parameter version, and suggests that the following NumPy code should work nicely:


def student_t_3(loc, df, scale, size):
    tau = 1/(scale**2)
    t_draw = np.random.standard_t(df = df, size = size)
    scaled_t_draw = t_draw / np.sqrt(tau)
    return(scaled_t_draw + loc)

...and in fact generating a large sample from this, as well as from R, followed by a quantile-quantile plot gives some confirmation that this works.

This could be considered the historical-materialist approach to coding statistical functions. I'm sure there is a reference text for this sort of thing somewhere but it's not the sort of thing which jumps to mind for a biologist.

It helps to know that numpy.random is coded expecting that everything is parameterized by scale parameters, whereas JAGS is coded expecting that (almost?) everything is parameterized with rate parameters.

Laplace mixture, bad code

Hm. Is this a reasonable implementation of a laplace mixture? The try/except seems to be an ugly way of dealing with the fact that int/float don't return a length, but it gets around testing for what they are...

import numpy as np


def laplace_mix(loc, scale, size, pvals = None):
    try:
        dim = len(loc)
        if dim != len(scale):
            raise NameError("Different number of location and scale parameters.")
    except:
        dim = 1
        loc = [loc]
        scale = [scale]
    if pvals is None:
        pvals = [ 1/dim for i in range(dim) ]
    args = zip(loc, scale)
    indicate = np.random.multinomial(n=1, pvals=pvals, size=size)
    where = np.where(indicate == 1)[1]
    out = np.zeros(size)
    for i in range(size):
        out[i] = np.random.laplace( loc = loc[where[i]], scale = scale[where[i]], size = 1 )
    return(out)

I'm still curious if there is a way of avoiding the for loop via numpy magic, but this is not a performance bottleneck so...

Feminism

A couple walks by and the man "play" punches the woman in the arm. "Don't hit me in public." she says, clearly frustrated.

The Singing Dodo