Monday, November 30, 2009

R farts: lists

I've been thinking about doing an R idioms post on indexing, but I need to talk about data structures to do that first... and I just can't get it together yet, so here's an R fart:

### Start with a list
aList <- list()

### Lists in R can hold vectors, of _unrelated_ types, with names 
### attached.  The list we made above is empty, and R will tell you 
### so if you ask:
print(aList)
str(aList)

### Once a list is created, we can add named objects to it like so:
aList[['bob']] <- 3
aList$frank <- 'five'
aList['3'] <- 'ten'
aList['4'] <- list(5)

### Each of these is neatly slotted in after the other and, as 
### expected, the list then has a length of four:
length(aList)

### The names of the objects can be extracted
names(aList)

### Either names or numbers can be used for indexing,
### then each element of the list is printed in turn:
for ( i in names(aList) ) {
  print( aList[[i]] )
}

for ( i in seq_along(aList) ) {
  print( aList[[i]] )
}

### Weirdness!
### If you index with single brackets, you get the 
### elements of the list, each as a list of length 1.  
### There's a reason this happens
for ( i in names(aList) ) {
  print( aList[i] )
}


### You can add unnamed objects like so:
aList[[10]] <- 8

### Something funny happened:
print(aList)

### When elements are added to a list by numerical index, and 
### the intervening elements between 1 and N do not exist, 
### the intervening elements are filled with unnamed NULLs.

### Now something funny happens:
aList[['5']] <- list(5)

### To actually get at that unnamed five as a length 1 numeric 
### vector you have to say:
print(aList[['5']][[1]])

### What's the use?  We'll do that next.


Wednesday, November 25, 2009

Race

Recently the discussion about whether avoiding the word 'race' in discussing human genetic differences is appropriate flared up again in blog-land.  One argument is that the word has an exact biological definition and that, even if some uneducated people use it as an expression of racism, biologists ought not change the scientific usage of the word.  I agree that the scientific usage does not need to change, and that hand-wringing about being accused of racism for using the word 'race' is inappropriate.

The problem is that it's not just some uneducated people who misuse the word 'race'.  Across a good swath of the western world---granted I only have personal experience with the U.S., Poland, and France---people want to classify others into 'races' in order to make arguments about their ethical standards, personal motivations, intelligence, patriotism, and numerous other measures of human worth.  Many of the people who want to make these assignments are well educated and wealthy, some of them are not.  Their assignments are based on: 1) skin color; and 2) accent.  They are not interested in what the biologists think about the word 'race'.

This use of the word 'race' is much more common than the current biological definition and it predates the current correct scientific usage.  Whether we like it or not, the historical usage of the word 'race' in science aligned much better with the current colloquial usage rather than the current scientific usage.

In that context, as biologists who would like to communicate effectively, we will have to explain what we mean by the word 'race' every time we use it in public.  Do we have much to gain in communication with the public by using the word 'race'?  Maybe if we want to educate them about the concept of biological races, but I don't see that as one of the big challenges.  In most other circumstances we are better off talking about human genetic differences in more useful terms---more on this soon.

I do want the public to understand and adopt the biologist's notion of human races but, as a biologist, I have better things to do.

My recent inspiration on the topic:
  1. Human races, from Larry Moran
  2. The problem of race... again, from Larry Moran
  3. Confusion about human races, from Richard Lewontin
  4. Global Distribution of Genomic diversity..., from Adam Auton et al.
The last is a journal article (free pdf), and it contains a less historically encumbered view of human genetic diversity.

Tuesday, November 24, 2009

Judge Fail

"Kids for Cash" is an apt description of the (hopefully) last episode in the unfortunate legal careers of Michael T. Conahan and Mark A. Ciavarella Jr.. Those two judges have the distinction of taking $2.6 million from Pensylvnnia Child Care---a juvenile detention facility---in exchange for sending kids to them without the benefit of appropriate hearings. Not just without the benefit of hearings, but with the benefit of "pretend" hearings conducted by Ciavarella where the juvenile offenders were denied counsel.

In U.S. law judges are partially protected from liability for decisions made on the bench, so the two perps in this case may only get part of the punishment they deserve for their actions. The best outcome, beyond what litigation brings, may be for the names Conohan and Ciavarella to become synonymous with the failure of judicial ethics.

Background:

- Juvenile Law Center page
- law.com 1, law.com 2

Monday, November 23, 2009

R Idioms: Counting observations

In the context of a data frame holding records of observations, with columns indicating grouping variables and/or measurements, one often runs into the need to count the number of observations.  There is no easy function in R that I know of for this, but 'aggregate' serves well enough.

Say we have a data frame ("data") with observations of plant heights measured over time.  The column names are: "individualId", "height", "species", "variety", "height".  We could count the number of observations in each variety as:

aggregate( x = data[['height']], 
  by = data[c('species','variety')], 
  FUN = length )

The use of the 'length' function was non-obvious to me.  Note the use of single brackets in specifying the columns of "data" in the "by"argument, and the lack of a comma in that expression.  This is done because 'by' must be a list.  Specifying data[,c('species','variety')] does not (always?) work.

We can of course calculate group means or variances in the same fashions, all of which becomes relevant when running statistical models.