Tuesday, December 20, 2011

The Believer's Bollocks: Douthat can't keep his hands off Hitchens' corpse

The NY Times columnist Ross Douthat recently wrote an op-ed on the untimely death of Christopher Hitchens.  In the op-ed, Douthat mostly discusses how Hitchens was able to maintain a warm and cordial relationship with people, such as William Lane Craig, who hold remarkably nasty opinions and parade them as moral teachings.  I fully agree with Douthat that this was one of the things which made Hitchens so powerful and it's an approach that's worth cultivating.  If you can't calmly discuss the ethical and logical failures of your opponent's position with them, you're not doing yourself any favors.

Friday, December 16, 2011

Everybody's gotta go sometime

Hitchens had a good run and I enjoyed watching it.

Tuesday, December 13, 2011

Bayesian Balls

I've been working broadening my understanding Bayesian thinking using Christian Robert's book The Bayesian Choice and one of the early examples in the book  had me confused.  The example is 1.2.2 which goes like so:
A billard ball \(W\) is rolled on a line of length one, with a uniform probability of stopping anywhere.  It stops at \(p\).  A second ball \(O\) is then rolled \(n\) times under the same assumptions and \(X\) denotes the number of times the ball \(O\) stopped on the left of \(W.\)  Given \(X\), what inference can we make on \(p\)?
Robert states that the prior distribution on \(p\) is uniform on \([0,1]\).  The first point of confusion for me was that the prior distribution in this example comes not from prior belief, but from the prior experiment where the ball \(W\) is rolled.

When I was first exposed to Bayesian statistics, I was excited about the flexibility for fitting models.  While I was aware that accepting these things called prior distributions had some epistemological implications, it did not bother me.  I viewed (and still view) the specification of prior belief into a distribution of parameters as a valuable way of making the researchers thoughts on the problem explicit.  After all if your analysis says what you want it to, but you can't justify your prior distribution to your peers, you won't have much luck convincing them of your result. 

What I missed was that a prior distribution can be a way of conditioning on belief, experience, or observed events.  This makes the prior distribution even more valuable because it can encode belief about a process rather than just a belief about the distribution.  For example if one assumes that the interval \([0,1]\) is very long and the billard ball won't make it to the other end, some sort of decaying distribution on \(p\) would make more sense than the uniform.  Robert briefly describes the rationale for using the prior distribution as follows:
... the use of the prior distribution is the best way to summarize the available information (or even lack of information) about this parameter as well as the residual uncertainty.
I remember this point from my mathematical statistics course but it's not surprising that it didn't strike me as especially important when I was dazzled by the fact that I just had to come up with a (log) posterior and a (log) prior and learn a couple of (simple!) algorithms to get answers out of my data. 

Ce n'est pas un "phi"

\[\phi\]

Columbia Workshop: Samuel Kou

This post is very much about transtitions.

Samuel Kou's talk was about a very important applied biology problem: how does one predict the three-dimensional structure of a protein from it chemical structure.  This is an interesting problem because it is quite important and initially you would think that science, in its current state, could provide some pretty good general answers. 

If we could get the three-dimensional structures of proteins in a quick way from their chemical structure it would be a great practical advance.  For example there are a number of protein mis-folding diseases, the most famous of which is Creutzfeld-Jakob disease and it's bovine cousin, "mad cow" (Bovine Spongiform Encephalopathy, BSE)[1].  Understand the structure of a wide variety of the prions (the mis-folded proteins) would definitely inform our understanding of how these diseases progress and perhaps how to interfere with that progression.  We could study the transitions triggered by prions which cause normal proteins to take on the pathological shape.  Carrying out the studies in silico would give us the opportunity to look for molecules which might interfere with the mis-folding process.


Conceptually, the road from knowing the chemical structure of a molecule made up of branching chains to knowing how it folds in on itself to get a 3-D structure is pretty easy.  Let's say you mix a little oil with a little more water, shake it up a little and watch what happens.  First the oil will be in small drops, then the drops will start to rise and their density nearer to the surface will increase (oil floats in water).  As the oil drops come in contact, they will sometimes merge into a bigger oil drop (once there's a bit of contact between two oil drops in water what you have is a single very oddly shaped oil drop.  Drops tend to a spherical shape, so the single oddly shaped drop gets reshaped to minimize the amount of contact with the water.) It's like going downhill, to a lower energy state: a transition. 

With atoms there are also some configurations that are more stable than others because, depending on which atoms are next to each other, you get a lower or higher energy.  Things tend to go to a lower energy (e.g.-people fall from ladders).  Physical chemists have gotten pretty good at calculating the energy of a certain configuration of atoms[2] so once we have two alternate structures for a protein we can figure out which one will be favored.  Bam, problem solved... sort of. 

The trouble with big proteins is that we can't just compare two alternate structures.  We would have to compare a lot of them.  I think the key result of combinatorics for the practicing scientists is that once you start dealing with the number of possible combinations of \(k\) objects, call it \(N\), the number \(N\) gets very big very fast.  If you could enumerate all the possible combinations, transform those to the x/y plane, and plot the energy of each configuration on the z axis, you would get something like figure 1.
Figure 1: big ol' molecule, location on x/y plane indicates the configuration, z axis indicates energy.  The lowest energy configuration is at the bottom of the pit.  Graphic from [3].

That well at the very bottom is the lowest energy configuration: it's the bottom of the hill.  One way to avoid the enumeration problem is to take some starting configuration and instead of comparing it to all other possible ones, just compare it to some similar ones and choose the best configuration.  Then repeat.  Ad nauseum.  With some luck (e.g.-you don't get stuck in a local minimum), and after enough time, the answer will be the lowest energy configuration.  So how do you guess at similar shapes to define those possible transitions?

Kou called his solution the called the Fragment regrowth Energy-guided Sequential Sampler (FRESS)[4] which uses information about known protein structures from protein data bank (hurrah for open data!) as well as a few other tricks to choose a fragment of the current structure, delete it, and replace the atoms into the chain in the same order but in slightly different locations.  Based on the talk I gathered that it works well and I can see why.  It combines multiple sources of information about what combinations are likely, it modifies small parts of the molecule at a time, and it efficiently deals with the oddities of the sample space.  It's an amazing piece of work, but this post isn't about that.  It's about transitions.

What I wanted to write about is this: Kou stated that the goal of the sampler is to find the bottom of the well, the lowest energy state, because that's the conformation that the protein takes on in nature.  Other statisticians in the room wondered at this property of protein folding in nature, one even (jokingly?) brought up God[5].  This issue came up because one of the key difficulties with a sequential sampler like Kou's is that if it's only considering a small neighborhood in protein-conformation space around the current conformation it might easily get stuck in a local minimum.  It would never find the true global minimum.  The same could be true in nature: either some or all of the proteins folded in an organism might end up at either one or several local minima. 

In nature I can think of two things which influence how, starting from the unfolded state, a protein tries out new configurations.  The first is random motion, and the neighborhood around the current configuration explored by this motion is controlled primarily by the temperature.  We know typically know the temperature range a protein needs to fold in, and we can probably calculate what size of moves it makes.  The second factor is a class of proteins called chaperone proteins (chaperonins).  These proteins bind temporarily to an unfolded protein guide the folding process.  The interaction results in a faster exploration of a particular part of protein-conformation space than would otherwise be possible.  It might overcome barriers to folding which would otherwise be insurmountable.  It's not hard to imagine that a chaperonin might lead another protein to folding into a local minimum configuration.  As long as the local minimum was deep enough that the protein was trapped there at its usual temperatures, it could function without ever folding to its global minimum.  I don't really know enough about this area to know whether the majority of proteins go to a local or global minimum, but if it hasn't been addressed before it would be an interesting question from an evolutionary perspective[6]. 

Thinking of prions and transitions again, the healthy and pathological shape of the protein are not thought to be chemically different.  It's possible that the only difference is the conformation.  In that case, it would be very interesting to know which one is the local versus global minimum conformation, and what folding steps are involved in the transition.  This is actually the sort of question which might be addressed by FRESS (or any other sequential sampler for protein folding) which made me very excited to hear about the details of the algorithm and see how much of an improvement it was on the state of the art.



[1] Lovely, accessible post on both @LSF
[2] Yes, I am being charitable.
[3] http://media.caltech.edu/press_releases/13230
[4] http://jcp.aip.org/resource/1/jcpsa6/v126/i22/p225101_s1?isAuthorized=no
[5] To be fair, Kou did state the as far as he knew the structures confirmed by x-ray crystallography have all been global minima (though that's hard to verify).
[6] and it might even help in figuring out good proposal distributions for a sampler like FRESS.

Wednesday, November 23, 2011

Christopher Hitchens, Rick Santorum

"...the gods that we have made are exactly the gods you'd expect to be made by a species that's about a half a chromosome away from being a chimpanzee."

One of the things I enjoy very much about hearing Christopher Hitchens speak is his capacity to produce pithy phrases.  The above is paraphrased from a "Hitchslap" youtube.com video and it's both a nice soundbite and a good argument. 

Think of religion as a crime scene, humanity as the potential criminal, and ask "well, did they do it?"  There motive is obvious since religion allows certain people to consolidate authority, serious financial benefits accrue to certain people at the center of the religion, and it allows the elite of the religion to manipulate standards of common decency and morality which might otherwise limit the willingness of the masses to follow their "guidance".  Sure plenty of despots over the years have managed to accomplish these things without relying on established religions, but for the rest it's shown itself to be quite handy.

Beyond the motive, human fingerprints are all over the products of religion.  For one thing, religious rules are the sort of generally nasty, controlling, misogynistic mess you would expect from works of convenience written over decades by gangs of fearful male politicians clutching onto power.  It's the sort of thing Rick Santorum or Michelle Bachman might come up with.  Of course Rick Santorum is not one to want to reinvent the wheel (not sure if he could have done it in the first place) and he just wants to go back to the good old days:

 Figure 1: Rick Santorum showing himself unfit for the 
                presidency of the United States of America by being
                incapable of taking the oath of office.

Really, Rick?  Why don't you just go run for president of Iran instead.


Tuesday, November 22, 2011

Egypt

If you pay attention to the world, you could probably see the recent scenes from Egypt coming.  It's hard to get rid of a police state.  The barrier is not just the leadership, the barrier is made of all the underlings frightened of the day when light will shine on the police files.  The barrier is made of people who go to work everyday to participate in the machinery of repression.  The barrier is the gang of proto-dictators waiting in the wings.  The barrier is the economic interest of the remaining political elite, in this case the military.  In addition, it was pretty clear from the beginning that Egypt's military was quite happy with its position of political and economic power.

May the second revolution go as well as the first, that's about the best you can expect.

Sunday, November 20, 2011

Re: Womanspace

Oh Nature, just admit you made a mistake publishing that worthless piece of fiction and move on with being a GlamourMag.

In other news, scientists discover ImmigrantSpace, a parallel dimension accessible only by immigrants trying to learn English and escape from violent gangs in the post-apocalyptic landscape of America's big city public schools.

Saturday, November 12, 2011

Penn. state football hides child rapist

Note: this is not a gently written post, it may be upsetting to some.  I also link to the grand jury document which will be upsetting to all.


I care about academic institutions because I am a graduate student and I plan on competing for a faculty job one day.  I care about education in general because it can be a great social good, and I care about education as the parent of two small children.  Before starting my current degree program I also worked in a number of labs which made important societal contributions.  In many ways it's a good "business" to be in and very satisfying, but I also have no illusions about the institutional politics of the Academy.  People in academia do unethical things and one of the downsides is that they're all quite good at rationalizing it.  It's that kind of environment, we select for it.

After a grand jury investigation, three men have been charged with either committing child rape or hiding it.  An assistant football coach, Mike McQueary, witnessed the rape of one of the boys and only reported it internally.  Then the otherwise highly respected football coach Joe Paterno also only reported the incident internally.  After these men realized that the internal process didn't reach its appropriate conclusion, neither chose to go to the police independently.

There is a lot in this story to be angry about.  First of all, in 2002, McQueary had the option of stopping the rape and calling 911.  He had the option of calling 911 and having the police intervene.  It was a basic failure of ethics on his part that he could not muster himself to any meaningful action.  His second chance came later, after he (must have) realized that there were no serious consequences for the rapist.  He had at least seven years to consider this possibility, but never made the right decision.  McQueary (after consulting his father!) reported the rape to Joe Paterno the next day (!) and McQueary was involved in a further meeting with perps #2 and #3 from the above grand jury investigation (Curly and Schultz) where he again related his account of the rape.  The consequences to the rapist were minimal (no locker room key for you Mr. Rapist Sandusky!) and the rape was never reported to the police.

In the end it was the original victim who, years later, relayed the story to his mother.  She notified the school district in 2009, the district notified the police, which led to the grand jury investigation.  It turns out that the mother and the victims who were willing to speak to the grand jury were quite brave.  The rest?  Cowards.  Men in power, three of them at the core of a sport which glorifies testosterone and aggression, completely lacking cojones and drifting through life without a functioning moral compass.

It also makes me angry that I can imagine the meeting McQueary had with Curly and Schultz.  I don't know if he was inclined to report the rape to the police before the meeting, but I'm sure soothing words were deployed about taking care of the "situation" internally and quietly.  Like in other big institutions, it's quite easy to get blacklisted in an academic setting.  People like McQueary work for years to get into the sort of career they want, all the while assiduously avoiding making enemies.  Once you've made that sort of investment it's hard to decide to risk it on your morals.  The investment itself is corrosive.

You might have months or years where the worst you witness is somebody stealing pipette tips from your lab, or trashing your paper in review only to publish first on the same topic.  Sooner or later something real will go down with you in the middle of it, and when it does, please remember that keeping your humanity really is worth risking your PhD/grant/reputation.  Consider it as an extra incentive to network broadly.

After stewing on this for a few days, I discovered that Jon Stewart put it quite succinctly, and is quite effective at conveying his anger as well.



Sunday, October 16, 2011

The unholly alliance between \(\phi_t\) and AIC

In the wild world of mark-recapture analysis, there is a well-worn path to publication.  Two of the key parameters are the survival and recapture probabilities in the marked population[1].  One of mile-markers is comparing two models for each parameter: either the parameter is constant, or it is different for each interval (survival) or occasion (recapture).  Sometimes an enlightened paper will skip this comparison, but there's really no guarantee.
Clearly, it's crucial to know whether the survival rate varies over time.

If you were looking at this question using hypothesis testing, you would take the following as your null hypothesis.

\[H_0 : \phi_1 = \phi_2 = ... = \phi_K
\]
The alternative would be that each phi is independent at each time step or something in between.  If you're using hypothesis testing, this brings up, front and center, the problem that H_0 is crazy.  No ecologist would honestly accept that as a null hypothesis.  Survival varies over time, end of story.  I'm pretty sure that the reason this test comes up is that MARK has a button for it.

Besides the fact that the null hypothesis is crazy[2] (though related), small mark-recapture studies, and most are small, will lack the power to find important temporal variation in survival[3].  If biologists were willing to give their theory a little bit more weight in deciding the issue, it would be sensible to take \phi to have some distribution over time (a "random" effect) and just deal with the estimates.  Even if the estimates were nothing special, at least it would save the world from the umpteenth discussion of whether \phi varies over time.

To further muddy the waters, these days the comparison is carried out using AIC rather than hypothesis testing.  I like ranking a set of well-fitting models as well as anybody else, but in this case it just masks the fact that the study was too underpowered to even contemplate the comparison to begin with.  At least when a hypothesis-driven comparison is not significant most biologists know that you shouldn't make a big deal of it without a power analysis[4].  With a table of AIC values, an inappropriate comparison is harder to spot.  One indicator is a slew of "comparable" models which ought to remind the reader that the models are only "comparable" in light of the data on hand. 

Now back to that paper...


[1] Survival in the marked and unmarked populations can be quite different, I'm sure hawks love mice with neon marker on their back.
[2] Common, there's survival variation when you look within a single manufacturing lot of CPU's, forget frogs in the wild.
[3] I just read one in a reasonable journal where the 95% confidence intervals basically stretched from 0.5-0.95, and the _title_ of the paper relied on the statistical lack of difference.
[4] Through the hard work of generations of biostatisticians, most biologists will recognize that making a big deal from a lack of significance is questionable.

Thursday, October 13, 2011

The Shotgun Approach to Hypothesis Formation


It's like going into a dark basement in a zombie movie.  You know there are zombies down there, so you might as well start shooting.  Each shotgun blast illuminates more zombies for you, and the process continues.  This is why academics start on a topic and continue until it sucks all the life out of them.  To continue the analogy, if you die at your desk it's because you ran out of bullets.

Sunday, October 9, 2011

If ... then ...

If P then Q is only false if P and not Q, if not P, then whatever.  Who knew?

It does help to get these things straight from a blog post if you don't breathe math in your career.

Saturday, October 8, 2011

Discuss


 system.time(while(sum(runif(10^6,0,10^(-316)) == 0) == 0 ) {})

My Apologies

My apologies to the audience.  As a graduate student in the middle of writing up paper #1, this blog is taking a back seat to the giant simulation and analysis-fest that is my life.  Further posting will resume when the beast is sent off to co-authors or XKCD #303 applies.

Friday, September 30, 2011

JAGS, dinterval


Edit: I posed this question to Martyn Plummer and got a very useful response. I'll post it once I figure out the details.


One of the hazards of using something like JAGS, with complicated innards written by someone else, is that you don't have much of a chance to the problems coming ahead.  I happen to have a continuous process which is observed only in discrete units and it happens to mimic the dinterval distribution in JAGS.  Unfortunately I recently had to change the number of breaks used by dinterval by an order of magnitude and it turns out that it bogs down fairly quickly.  Here's an example:


cat('
model{
    for ( i in 1:N) {
        x[i] ~ dunif(0,10^K)
        o[i] ~ dinterval(x[i],breaks)
    }
}
', file = './dintervalProblem.bugs'
)

library(rjags)
mod <- list()
samp <- list()
timings <- list()
k <- c(3,4,5)
for ( i in 1:length(k) ) {
    N <- 100
    K <- k[i]
    x <- runif(N,0,10^K)
    o <- ceiling(x)
    breaks <- seq(0,10^K,1)

    timings[[i]] <- system.time(
        {
            mod[[i]] <- jags.model(
                file = './dintervalProblem.bugs',
                data = list(
                    N = N,
                    K = K,
                    o = o,
                    breaks = breaks
                ),
                inits = list(
                    o = o,
                    x = x
                ),
                n.chains = 1,
                n.adapt = 1000,
                quiet = FALSE
            )

            samp[[i]] <- jags.samples(
                model = mod1,
                variable.names = c('x','o'),
                n.iter = 1000,
                thin = 1
            )
        }
    )

}

My digging in the JAGS internals hasn't turned up anything yet which could be responsible for this behavior.  The value of the dinterval node is calculated by:


static unsigned int value(vector const &par, unsigned int ncut)
{
    double t = T(par);
    for (unsigned int i = 0; i < ncut; ++i) {
  if (t <= CUTPOINTS(par)[i])
      return i;
    }
    return ncut;
}


It's not like there's a separate node for every breakpoint... though the number of nodes generated in the graphs does go up with K.

Suggestions?  (Especially all those folks coming here from Andrew Gelman's blog...)

Thursday, September 29, 2011

Columbia Workshop, Katia Koelle would've been a good participant

I've been wracking my brain for other non-obvious applications of the Euler characteristic to biological data and I was reminded of some work[1] done by Katia Koelle.  My understanding of the work is that she sought to characterize whether a virus evolved in the face of herd immunity by accumulating a wide diversity of anitgenic variants or whether it would tend to avoid herd immunity via replacement of older antigenic variants by newer ones.  On a quick re-reading of her paper (I've seen this work presented at ESA twice), I don't think there is much of a connection with Robert Adler's pursuit of stochastic algebraic topology, but even if there was I'm probably not the person to see it.
Figure 1: Katia Koelle's opening slide at the 2009 ESA in ABQ, NM looked more or less like this.  I found it very disturbing every time somebody sneezed during her presentation.
On the other hand I've found this work to be an impressive combination of practical insight combined with the sort of mathematical and statistical detail that the Columbia workshop was targeting.  It's a shame nobody thought to invite her.

-----------------------------------------------
[1]Koelle, K., Ratmann, O., Rasmussen, D.A., Pasour, V., Mattingly, J. (2011) A dimensionless number for understanding the evolutionary dynamics of antigenically variable RNA viruses. Proceedings of the Royal Society, Series B.

Wednesday, September 28, 2011

Columbia Workshop, Robert Adler

Warning: my understanding of these talks was at times pretty thin and any error in discussing the topics are almost certainly mine.

Robert Adler's talk focused on uses of the Euler characteristic to summarize complex shapes such as the brain in 3-D.  The fact that a number derived by algebraic topologists would nicely do that sort of thing for you is not a big surprise---I gather that's one of goals of the field[1].  It was interesting to hear the Euler characteristic defined in terms of the pattern of adding and subtracting alternating features of a shape because it reminded me of a children's book I recently read to my son which uses a number like that as part of a puzzle.
Figure 1: Algebraic topology is everywhere!  My son now knows how to calculate the Euler characteristic for solid straight-sided shapes.
I went back and checked and it turns out the definitions are identical and indeed the Euler characteristic is defined in that book for polyhedra.

More interesting from a data visualization point of view was the presentation of a (I gather common) method for representing the components (e.g.- verteces, edges, and faces in the case of polyhedra) which make up a complex shape and how they are "born" and "die" when a descending threshold is applied to the object in one dimension.

The thresholding is best explained using a landscape (mountains, valleys, that sort of thing) in three dimensions.  The threshold is a plane at a set height in 3-D space, starting well above the landscape.  As the threshold descends it first contacts the highest points on the landscape and those features are "born" in the visualization (barcode), as it descends through the landscape there is a set of rules for when new components are "born" or "die".  Robert Adler was kind enough to mention some of these topological rules in passing to give us a sense of their flavor but he spared us the details[1]. 
Figure 2: Thresholding on a 4-D space with the resulting barcode (3 categories of bars).  Each bar represents the lifespan from birth (on left) to death (on right) of a feature. This could be applied to summarizing the brain. [2]
My understanding of the main point of the talk was that these barcodes were a convenient way of summarizing multi-dimensional shapes and they were currently being applied to problems such as the comparison of brain scans from alzheimer's patients and matched controls.  According to Adler, the main difficulty with making these comparisons (not that it's stopping anybody) is that there is no good understanding of the statistical properties of these barcodes, therefore no method for defining a null hypothesis or statistical model for the data (what would non-informative priors on barcodes look like anyway?)

Obviously these techniques are applicable to biology since brain scans were one of the examples Adler discussed, but I wonder if there are other less obvious places where these approaches are relevant.  After all, the barcodes are generated by thresholding an n-1 dimensional object living in an n-dimensional space.  It doesn't have to be a physical object.  I assume that you need some local autocorrelation to get any coherent barcode (think of the autocorrelation of height in the 2-D landscape in 3-D space), but other than that there seem to be few requirements for crunching the numbers.  This is, of course, a discussion I should have tried with Dr. Adler while he was in front of me, but I hadn't digested the talk yet so I missed my chance.

I guess the question I need to spend some time with is: what other types of spaces can be summarized in this way?  The local autocorrelation in the thresholded values gives some clues but it's not very clear.  For example, can pedigrees be visualized this way (based on the continuity in genetic similarity)?  Phylogenies more generally?  Is there a point to that summary?


Figure 3: The Alma Mater at Columbia says "wha'dup!"


-----------------------------
[1] Hurrah for fields and sub-fields!
[2] http://webee.technion.ac.il/people/adler/research.html

The Flunge

In foil and epée fencing, the flêche is a beautiful maneuver used to quickly close distance when the attacker is fairly certain of getting a touch.  If it looks like the attacker is running (i.e.-upright) they're doing it wrong.
Figure 1: A nice successful flêche, in epée.*
In sabre fencing crossing one's feet while moving forwards is against the rules (I hear the sport devolved into too much charging and slashing so they eliminated the charging, kept the slashing.  I prefer it that way).  However, trying to get sabreurs to calm down is generally counter-productive so the flunge quickly arrived on the scene.  Now the goal is to score the touch before your nose hits the floor.  It's still a rather artistic move.
Figure 2: A nice flunge, in sabre.  There really is no going back from that.
Figured that picture could use a few words.


----------------------------------
* Thank you: http://www.flickr.com/photos/adonovan/418225607/lightbox/

Tuesday, September 27, 2011

Columbia workshop, pre-pre-post

Note: a prelude to writing about two and a half days of math.

Much of ecology and evolutionary biology lives firmly in the world of small data[1] and the field has done quite well (thank you very much) with regression and [AN,MAN,AMO,ANC]OVA[2].  There's a pretty strong tradition of teaching good frequentist statistics and my own introduction to that, by Bill Bradshaw, was fairly complete (if a little brutal).  Bill's approach was to go step by step, and thoroughly cover the mechanics of each technique he presented. 

Bill's course had to be a little brutal because the mathematics requirements for biology are generally pretty low.  We biologists act as though algebra is frightening to students, linear algebra should only be brought up in passing and quickly swept under the rug, and calculus is beyond the pale.  Anyone who would like to avoid that general attitude is faced with the problem that most biology students are unprepared for moderately mathematical subjects.  Bill's solution was to go at it like his was the only course students had signed up for and expect everyone to keep up.  Much of the class looked something like this:
                               Figure 1: Bill Bradshaw leads the charge against the axis of evil (ANCOVA, ANOVA, and regression).
The general attitude Bill maintained was that you needed a solid statistical background to do ecology and there was no reason to doubt that a biologist could "get it".  Besides, as mathematically inclined[3] types are likely to tell you, the math you need to get into for the purposes of understanding basic statistics (in the mathematical sense), hardly counts as math. 
Figure 2: Peter Pappas says, "you see this?", "This is _NOT_ math."
The real barrier to introducing more advanced math to biology courses (you know, beyond spreadsheets) is that math builds on itself. As a biologist I might encounter it first in a population ecology book as matrix multiplication and my eyes would just conveniently glaze over until the little "eq. 1" on the right margin passes.  The fact is that a linear algebra class devotes significant time to simple mechanics and its not surprising that an untrained person will not just "get it" on the spot.  That means either raising the bar for course requirements or building the time for math into biology courses.  Unless your whole department is excited about raising standards, the second route is probably the only one that's open the an instructor.  You just have to go it alone.
Figure 3: The author, about to pwn calculus in front of an ecology class (without departmental support).

All this is to say that there's a gulf between biology and mathematics/statistics which is damaging to the quality of work in biology.  I think it also shows up in published work, but that's for another day (XKCD #303 only holds for so long).  I'm interested in advances in mathematics and statistics, and how they are relevant broadly to biology research (as well as my own work).  This led me to spend two and a half lovely days in NYC this past weekend attending a workshop at Columbia University which I intend to write about, once I've had the opportunity to digest the lectures a little more.

-----------------------------
[1] Yes, I know, NEON will change everything, as will Data Dryad, etc...
[2] You know, it's robust to stuff.
[3] The word "inclined" should have been linked to Peter Pappas, Vassar College professor of mathematics who enjoyed referring to calculus/linear algebra/multivariate calculus as "trivial" and "not math", much to the dismay of his captives students.  Can't find him right now though.

Public Service Announcement

You know those vials in the freezer? The ones that would really love to have a date with some propanol/hexane before getting shot into the HPLC? The fish scales sitting in envelopes begging to have their rings counted?  The bird feathers you really think you should spec just because they're there?  If you don't know what question you want to use them to ask, they're not data, they're a waste of time.

Thank you for listening to this announcement.

Had this been a real emergency, your PI would have run screaming into the room waiving those vials in the air.

Wednesday, September 14, 2011

HPV vaccine

At the recent debate of hopefuls for the Republican presidential nomination, Michelle Bachman apparently made a series of ugly false claims about the safety record of the Human Papilloma Virus (HPV) vaccine.  It was the usual dribble of fear-spasms calculated to appeal to a rabid anti-feminist sub-group of the electorate.

It was ugly because HPV kills: along with the initial, rather unpleasant* disease, HPV causes cervical, vulvar, vaginal, penile, anal, and oropharyngeal cancer.  In fact it causes most cervical cancers which mean that the HPV vaccine might be the first pharmaceutical which lets us cure a cancer.    HPV vaccine also has a shiny safety record, better than the safety record of Tylenol/aspirin/ibuprofen which I'm sure Bachman regularly throws down her gullet.  So, here's the American Academy of Pediatrics (AAP) to set the record straight:
 “The American Academy of Pediatrics would like to correct false statements made in the Republican presidential campaign that HPV vaccine is dangerous and can cause mental retardation. There is absolutely no scientific validity to this statement. Since the vaccine has been introduced, more than 35 million doses have
been administered, and it has an excellent safety record."
Furthermore, this is only an effective way to cure cancer population-wide if it happens in the right time-frame, i.e.-before sexual activity.
“The American Academy of Pediatrics, the Centers for Disease Control and Prevention, and the American Academy of Family Physicians all recommend that girls receive HPV vaccine around age 11 or 12. That’s because this is the age at which the vaccine produces the best immune response in the body, and because it’s
important to protect girls well before the onset of sexual activity. In the U.S., about 6 million people, including teens, become infected with HPV each year, and 4,000 women die from cervical cancer. This is a life-saving vaccine that can protect girls from cervical cancer.”
Dear parents: do it.  Dear everybody else: if you're considering voting for this woman, or any other Republican candidate, please convince them to engage in political grandstanding less likely to cause the death of women and the destruction of families.
---------------------
* Sorry if you clicked that.

Sunday, September 11, 2011

Three parameter t distribution (in JAGS, but not numpy.random)

I can't seem to find a three-parameter t-distribution in numpy/scipy and I know JAGS has one.  It turns out it comes from the "jrmath" library which is a part of R.


/*
 *  Mathlib : A C Library of Special Functions
 *  Copyright (C) 1998 Ross Ihaka
 *  Copyright (C) 2000-2008 The R Development Core Team
 *
 *  This program is free software; you can redistribute it and/or modify
 *  it under the terms of the GNU General Public License as published by
 *  the Free Software Foundation; either version 2 of the License, or
 *  (at your option) any later version.
 *
 *  This program is distributed in the hope that it will be useful,
 *  but WITHOUT ANY WARRANTY; without even the implied warranty of
 *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 *  GNU General Public License for more details.
 *
 *  You should have received a copy of the GNU General Public License
 *  along with this program; if not, a copy is available at
 *  http://www.r-project.org/Licenses/
 *
 *  DESCRIPTION
 *
 *    Pseudo-random variates from a t distribution.
 *
 *  NOTES
 *
 *    This function calls rchisq and rnorm to do the real work.
 */

#include "nmath.h"

double rt(double df, RNG *rng)
{
    if (ISNAN(df) || df <= 0.0) ML_ERR_return_NAN;

    if(!R_FINITE(df))
  return norm_rand(rng);
    else {
/* Some compilers (including MW6) evaluated this from right to left
  return norm_rand(rng) / sqrt(rchisq(df) / df); */
  double num = norm_rand(rng);
  return num / sqrt(rchisq(df, rng) / df);
    }
}


And from modules/bugs/distributions/DT.cc:


double DT::r(vector const &par, RNG *rng) const
{
    return rt(DF(par), rng) / sqrt(TAU(par)) + MU(par);
}


Which gets you the three-parameter version, and suggests that the following NumPy code should work nicely:


def student_t_3(loc, df, scale, size):
    tau = 1/(scale**2)
    t_draw = np.random.standard_t(df = df, size = size)
    scaled_t_draw = t_draw / np.sqrt(tau)
    return(scaled_t_draw + loc)


...and in fact generating a large sample from this, as well as from R, followed by a quantile-quantile plot gives some confirmation that this works.

This could be considered the historical-materialist approach to coding statistical functions.  I'm sure there is a reference text for this sort of thing somewhere but it's not the sort of thing which jumps to mind for a biologist.

It helps to know that numpy.random is coded expecting that everything is parameterized by scale parameters, whereas JAGS is coded expecting that (almost?) everything is parameterized with rate parameters.

Laplace mixture, bad code

Hm.  Is this a reasonable implementation of a laplace mixture? The try/except seems to be an ugly way of dealing with the fact that int/float don't return a length, but it gets around testing for what they are...




import numpy as np

def laplace_mix(loc, scale, size, pvals = None):
    try:
        dim = len(loc)
        if dim != len(scale):
            raise NameError("Different number of location and scale parameters.")
    except:
        dim = 1
        loc = [loc]
        scale = [scale]
    if pvals is None:
        pvals = [ 1/dim for i in range(dim) ]
    args = zip(loc, scale)
    indicate = np.random.multinomial(n=1, pvals=pvals, size=size)
    where = np.where(indicate == 1)[1]
    out = np.zeros(size)
    for i in range(size):
        out[i] = np.random.laplace( loc = loc[where[i]], scale = scale[where[i]], size = 1 )
    return(out)


I'm still curious if there is a way of avoiding the for loop via numpy magic, but this is not a performance bottleneck so...

Feminism

A couple walks by and the man "play" punches the woman in the arm.  "Don't hit me in public." she says, clearly frustrated.

Tuesday, September 6, 2011

What the new country is coming to....

This was such an awesome idea the first time around, let's do it again!

What the old country is coming to...


Friday, August 19, 2011

It warms my heart

It really warms my heart when things go right in the world.  Twenty-eight years and $1 million is about right for a judge preying on vulnerable children.  This part gets me:
The sentence was four times the 87 months sentence that Ciavarella and federal prosecutors had agreed to when he pleaded guilty to charges in 2009.
The prosecutors had a good case against this guy and they wanted to agree to _what_!?!?!? Many thanks to the judge who threw out that plea-deal.

Saturday, August 13, 2011

Improving statistical understanding in biology: teh ANOVA, it's robust to stuff

Recently there was a blog post at Expansed about how R might improve the practice of statistics.  The basic idea is that when you have to face the code which implements your statistical analysis you might think it through a little more than clicking.  I think that logic fail pretty quickly, faster than you can say cut-and-paste.  The practice of cut-and-pasting R code for a whole analysis (or worse, BUGS code!) is common even with practitioners who have worked with those tools for multiple years.  Sooner or later it catches up with you and you might learn something, but it's not inevitable.

For kicks, I coded up the following to express my dismay at the state of art:



files <- dir()
output <- list()
for ( f in files ) {
   dat <- read.csv(f)
   output[[f]] <- lm( y ~ ., data = dat )
   test <- summary(output[[f]])
   ps <- test$coefficients[,4]
   if (any(ps < 0.05)) {
      print(test)
      cat("Eureka! There's gold in ", f, ".", sep = "")
   }
}





In a few weeks I'll try to code up something that uses stepwise significance testing to do variable selection.  Now how long before this makes it into somebody’s analysis...? Drop me a line if you use it!


Tuesday, July 26, 2011

Digging into JAGS

I've heard people characterize Martyn Plummer's JAGS as "partially interpreted" or "interpreted".  This typically comes up as an explanation for why it's "slow". I've always been puzzled by this claim: it's not!  The original specification in the BUGS language is compiled into a DAG using yacc/flex (I'm pretty sure).  The DAG is stored in memory with one node for each scalar quantity or entry in vector/matrix.  There are two important performance implications:

First, large models blow up in size fairly quickly since each vector/matrix entry has to be stored as a node.  It's a less efficient format than, for example, a matrix, but that's because it carries additional information (e.g.-a name, which you are very thankful for when debugging).

Second, when updating parameters, the algorithm must walk the entire graph to update one iteration.  That's a lot of steps and this does impact speed.  One could make more efficient samplers which updated batches of related parameters, but that requires an algorithm to figure out how to do that based on an analysis of the DAG.  I assume that's what the GLM module must do in order to discover which parameters can be block updated, but I haven't looked at that part of the JAGS C++ code yet. 

Finally, one question which comes up frequently is whether JAGS can use multiple cores/processors.  The current answer is generally no, except when the underlying linear algebra library is multi-threaded.  Even when that is possible, and you have code like A %*% b, you might get a performance penalty just because the matrix multiplication is re-done every time that an entry in 'b' is updated.  The best possibility I can see for multi-core JAGS (take it with a grain of salt, I'm just sticking my node into the C++ code) is to use the conditional independence in a complex model to hand out piece of the DAG to separate cores for updating.

The way this currently works is that JAGS does a topological sort of the DAG such that if A|B, B is updated before A.  This is a partial ordering because if A|B and A|C, the updating order could be B->C->A or C->B->A.  I don't know if JAGS always comes up with a single order, or if it gets switched depending on the run.  This is where there's some opportunity for parallelism because B and C could be updated on separate cores, followed by updating A.  There would be some overhead involved in figuring out when to use multiple cores, and in parsing out pieces of the graph to different processes so it's not clear what sort of benefits we would get (e.g.-B and C could be sizable sub-graphs).  I know there are heuristic algorithms out there for solving this problem, and an illustration in R for a random DAG goes something like this:


library(igraph)

# Make a DAG
g <- barabasi.game(10)

# Show it
plot(g,layout=layout.kamada.kawai)

# Calculate which parameters each parameter is conditional on
parentage <- list()
for(i in 0:9) {
    parentage[[as.character(i)]] <- setdiff(subcomponent(g,i,'in'),i)
}

# Calculate group for calculation:
group <- vector(mode='numeric', length=length(parentage))
names(group) <- names(parentage)
for ( i in names(parentage) ) {
    group[i] <- length(parentage[[i]])
}

# Sort group from fewest to most parents:
sampleOrder <- names(sort(group))

# Split into groups for multi-core sampling:
sampleGroups <- split( x = sampleOrder, f = factor(sort(group)) )


The process goes something like this:
  1. spawn a process to calculate density for each level i node.
  2. check for results at level i, when results arrive, check if
       any level i+1 node has all its dependencies satisfied.       
    1. yes: restart a process there with i+1 at #1
    2. no: return to #2
It's a little more complicated because you don't necessarily have enough cores for all the nodes at one level, you need to group pieces of the graph to minimize interprocess communication, etc... but it's been done.
  
Now if only I could win the lottery and pay somebody to do this...

Metropolis alogrithm as an R closure

I've been wanting to understand closures in R a little better for a while and MCMC algorithms provide a nice use case.  In coding an MCMC algorithm, it's easy to write a function to take you one time step forward.  In most naive implementations what you see is a pile of code that manages the data and calls that function repeatedly.  The point of a closure is that you can wrap the code for tracking the data around the function which takes the MCMC steps and have a nice package which is easy to call repeatedly.  A quick implementation looks like so:


mcmcChain <- function(logTarget,proposal,cntrl) {
    if ( !all(c('nIter','inits') %in% names(cntrl)) ) {
        stop('\'cntrl\' argument must be a list with \'nIter\', and \'inits\' elements')
    }
    out <- matrix( data = NA, nrow = length(cntrl[['inits']]), ncol = cntrl[['nIter']] )
    out[,1] <- cntrl[['inits']]
    logD <- vector( mode = 'numeric', length = cntrl[['nIter']] )
    logD[1] <- logTarget(out[,1])
    logA <- vector( mode = 'numeric', length = cntrl[['nIter']] )
    prop <- vector( mode = 'numeric', length = cntrl[['nIter']] )
    chain <- function() {
        for ( i in 2:cntrl[['nIter']] ) {
            prop[i] <<- proposal(out[,i-1])
            logD[i] <<- logTarget(prop[i])
            logA[i] <<- logD[i] - logD[i-1]
            if (logA[i] > 0) {
                out[,i] <<- prop[i]
            } else {
                out[,i] <<- out[,i-1]
            }
        }
        o <- list(
            sample = out,
            proposal = prop,
            logD = logD,
            logA = logA
        )
        return(o)
    }
    return(chain)
}

The assumptions for this code are that the log target density and the proposal density take their parameters in a vector and return a vector with the same structure.  The output is a list with the sample, all the proposed values, log density, and log acceptance ratio.  It can be run to fit a normal distribution to three concocted values (0.1,0,-0.1) like so.  The sample is actually generated with the out <- tp() call.


logTarget <- function(mu) {sum(dnorm(x = c(0.1,0,-0.1), mean = mu, sd = 1,log=TRUE))}
proposal <- function(mu) { mu + runif(1,-0.1,0.1) }
cntrl = list( nIter = 1000, inits = 3)
tp <- mcmcChain(
    logTarget = logTarget,
    proposal = proposal,
    cntrl = cntrl
)
par(mfrow=c(2,3))
out <- tp()
plot( x = 1:cntrl[['nIter']], y = out$sample, pch = '*', xlab = 'iteration', ylab = expression(mu) )
plot( x = 1:cntrl[['nIter']], y = out$logD, pch = '*', xlab = 'iteration', ylab = 'log density')
plot( x = 1:cntrl[['nIter']], y = exp(out$logA), pch = '*', xlab = 'iteration', ylab = 'a')
hist( x = out$sample, xlab = expression(mu), breaks = cntrl[['nIter']]^(1/2) )
hist( x = out$logD, xlab = 'log density', breaks = cntrl[['nIter']]^(1/2) )
hist( x = exp(out$logA), xlab = 'a', breaks = cntrl[['nIter']]^(1/2) )


This is pretty flexible as you can substitute any target density and proposal density, and in my own work, the first time I wanted to do something more complicated was in dealing with structured data (e.g.- mark-recapture data). Speed can be an issue with MCMC samplers written in R, but in any complex problem it would generally be the calculation of the target density which matters most.  That could be implemented with in C++ (using Rcpp) and used with the above code. Please let me know (to satisfy my curiosity) if you use this code.

Saturday, July 23, 2011

Monday, July 18, 2011

Minimalist MCMC, in R

Christian Robert recently posted a minimalist piece of R code implementing a Metropolis sampler for an arbitrary target function. Here's the code with a few edits:

it <- 10^5
target = function(x) (x>-2)*(x<2)*dnorm(x=x, mean=0, sd=1)
mcmc=rep(0,it)
for ( t in 2:it ) {

   prop = mcmc[t-1] + runif(n = 1, min = -0.4, max = 0.4)
   if ( runif(1) < target(prop) / target(mcmc[t-1]) ) {
      mcmc[t] = prop
   } else {
      mcmc[t] = mcmc[t-1]
   }
}
hist(mcmc)
plot( x = 1:it, y = mcmc, pch = '*' )




This was just throw-away code used to answer a simple question about MCMC, but it shows that the code can be pretty minimal.  People writing their first MCMC samplers, even if they are familiar with programming often make a predictable series of mistakes and writing some example samplers which deal with these problems in this format might be a useful approach to guiding them through it.

Some of the easily avoided problems which trip people up:
  1. Trying to work with densities, rather than log-densities.  It fails fairly quickly.
  2. Failing to deal with boundaries of parameter space (the sampler above _does_ deal with them in the target function).
  3. Ignoring obvious caching opportunities.
Those are probably enough goals for minimal MCMC code for now.

Saturday, July 16, 2011

Histogram in Terminal

I'm sure there's an R packages with this kind of thing out there, but I didn't find it.  Often when working with MCMC on a server over ssh, you need to get a rough idea about parameter distributions and convergence.  This code outputs a reasonable histogram on the command line without needing graphics.  Something similar for traces would be nice.



    histo <- function(x,...) {
        o <- hist(x, plot = FALSE, ...)
        w <- options()[['width']]
        labels <- signif(o[['mids']],2)
        pad <- max(nchar(labels))
        densities <- o[['density']]/max(o[['density']]) * (w-pad)
        for ( i in 1:length(densities)) {
            padI <- pad - nchar(labels[i]) + 1
            if ( labels[i] >= 0 ) {
                cat(' ', labels[i], rep(' ',padI-1), '|', sep='')
            } else {
                cat(labels[i], rep(' ',padI  ), '|', sep='')
            }
            cat(rep('*',densities[i]), sep='')
            cat('\n', sep='')
        }
        return(o)
    }

    h <- histo(tp, breaks = 30)

The height (horizontal) of the histogram bars comes from the options, so if it's printing too wide for you, set "options(width = ???)" to something that works.

Tuesday, June 14, 2011

Immigration Debate

Could we stop ruining peoples' lives for the sake of the Republican presidential primaries and pass some immigration reform already?

p.s.-Sorry America, foreigners write some of the best lyrics:

Saturday, June 4, 2011

Dear Pandora

Dear Pandora,

Why do we need remixes of Kesha songs?  They start out staccato spastic and adding a techno beat just doesn't do much to save them.

Sincerely,

Dodo

Why do ecologists still use Excel for statistics?

Thanks to John Cook for linking to an article on the statistical shortcomings of Excel (rampant and unrepented):
The random number generator has always been inadequate. With Excel 2003, Microsoft attempted to implement the Wichmann–Hill generator and failed to implement it correctly. The “fixed” version appears in Excel 2007 but this “fix” was done incorrectly. Microsoft has twice failed to implement correctly the dozen lines of code that constitute the Wichmann–Hill generator; this is something that any undergraduate computer science major should be able to do. The Excel random number generator does not fulfill the basic requirements for a random number generator to be used for scientific purposes: (1) it is not known to pass standard randomness tests, e.g., L’Ecuyer and Simard’s (2007) CRUSH tests (these supersede Marsaglia’s (1996) DIEHARD tests—see Altman et al. (2004) for a comparison); (2) it is not known to produce numbers that are approximately independent in a moderate number of dimensions; (3) it has an unknown period length; and (4) it is not reproducible. For further discussion of these points, see the accompanying article by McCullough (2008); the performance of Excel 2007 in this area is inadequate.
This makes me want to design an undergraduate course in ecolgical statistics which triggers one of these errors in each homework (you might have to do more than one to get through all of them in one course)!

Wednesday, June 1, 2011

Chainsaw on a rope swing


John Cook quotes Merlin Mann:
If a project doesn’t have an owner, it’s like a chainsaw on a rope swing. Why would anyone even go near that?
By itself, that is a fantastic visual.  Personally, I try to run as fast as possible. Then he comments further:
Perhaps worse than a project with no owner is a project with a powerful owner who doesn’t care about the project. The project is important in the sense that the worker bees will be held responsible for seeing it happen, but not so important that it’s worth the owner’s time to help. “This is important for you to develop, but not important enough for me to take the time to tell you in any detail what it should be.”
Sure this is a problem in industry, but it's practically the definition of academia.

Sunday, May 29, 2011

Graduate Students are Employees

When PLS asked his readers "are graduate students employees?", I was surprised by the range of answers.  What struck me is that for a lot of people, this is an either-or kind of distinction.  It's not!  In the sciences, graduate students are primarily supported by one of two mechanisms---teaching undergraduates, or doing research for which their advisor acquired funding.  The relationship (often a contract) comes with responsibilities and benefit:

Responsibility: use more or less the specified amount of time to, you know, carry out the work.  It's more like a professional position than MacDonald's so you get to choose how some of it is done, but there are either products to be delivered, papers to be written, or science objectives to achieve.  If the work doesn't fit in the specified amount of time, find out how your supervisor would like you to prioritize.  If they won't prioritize you get to do it for them!  If you do a shitty job, they don't have to hire you again.

Benefits: money! health insurance! Pay rent! pay doctor's bills!  Feed your children! Pay the babysitter!  Buy antibiotics when your child gets pneumonia!  These are _awesome_ adult things to do with money! You can't do it without the money!  If the money is a bonus to you, rather than a benefit, congratulations---you must be a trust-fund baby! 

When I say that graduate students are employees, I'm claiming these responsibilities and benefits for the time I commit to my RA/TA.  I feel strongly that this is actually good for science because being an employee, and acting professionally, is a good framework for being productive despite having different goals from your advisor/supervisor.

The funny thing about an RA/TA is that unless you are in that special situation where your research goals match up with your RA, you do have different goals from your advisor/supervisor.  You are still also a student and you have to meet your requirements, get through your candidacy exams, plan and execute your own research (project or dissertation).  While these things benefit your advisor, s/he is probably too busy to worry about them and they will not happen unless you push them.

If you don't develop mechanisms for negotiating the conflicting requirements, everyone looses.  Just look at all the horror stories in blog land.  Some people object to this view because the student benefits in terms of experience, name recognition, and publications.  The fact is that many professional jobs bring you those things and it's unsurprising that an academic RA/TA might.

Saturday, May 14, 2011

BUGS

I've recently been working with multiple models on different parts of a large data set, described within one BUGS file.  They have to be described in one file because they describe inter-dependent processes, for example growth and survival.  Both are described in BUGS as:

${A[i] \sim \text{dbern}(\ldots + q\times B[i])}$
${B[i] \sim \text{dbern}(\ldots)}$
The interesting part is in the parameter calculation for process B, we accidentally used a parameter I intended to use for process A.  The parameter was already used in process A.  Unlike most coding bugs in BUGS (rather than conceptual errors) this causes no complaints.

In our code this was an especially nasty bug because the shared parameter
is one of a half-dozen which capture how a spline affects A.  I'm certain our results would have been affected, maybe at a later time.  Tracking the bias to the bug (if we even noticed) would have been a nightmare.

The moral is: read your code, carefully.  Even better, beg somebody else read your code (carefully).




Wednesday, May 11, 2011

Twist it a little more, Cohen!

Andrew Cohen writes this about John Yoo:
Former government lawyer John Yoo taking credit on behalf of the Bush administration for Sunday's strike against Osama bin Laden is like Edward John Smith, the captain of the Titanic, taking credit for the results of the 1998 Academy Awards.
John Yoo is of course trying to do just that.