On Tue, 1 Dec 2009 14:10:17 +1300 Rolf Turner <r.tur...@auckland.ac.nz> wrote: > Consider the following: > > > set.seed(42) > > ff <- factor(sample(c(1,3,5),42,TRUE),levels=1:5) > > x <- runif(42) > > tapply(x,ff,sum) > 1 2 3 4 5 > 3.675436 NA 7.519675 NA 9.094210 > > I got bitten by those NAs in the result of tapply(). Effectively > one is summing over the empty set, and consequently (according to what > I learned as a child) I thought that the result would be 0.
Note that this *is* documented on the help page for 'tapply', actually, in its description: Apply a function to each cell of a ragged array, that is to each (non-empty) group of values given by a unique combination of the levels of certain factors. Basically (ignoring some details) 'tapply' does: sapply(split(x, ff), sum) Which actually *does* give you 0 for level 2 and 4. The reason is (again ignoring some details) 'tapply' does: sapply(split(x, as.numeric(ff)), sum) which only looks at the actual values of 'ff', not its levels. Note that value 'zero' is not a special case. For instance, sapply(split(x, ff), prod) gives the 'empty product', i.e., 1. Exercise to the reader: Note that sapply(split(x, ff, drop=TRUE), sum) gives you the values of (just) the non-empty levels. Now, why does sapply(split(x, ff), sum, drop=TRUE) give the wrong value (1) for these levels, while sapply(split(x, ff), sum, drop=FALSE) gives the the correct value? (The answer should be fairly obvious, but it's an easy mistake to make.) -- Karl Ove Hufthammer ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.