On Wed, Nov 07, 2007 at 08:15:17AM +0100, Peter Dalgaard wrote: > Andrew Robinson wrote: > >These are important concerns. It seems to me that adding an argument > >as suggested by Bill will allow the user to side-step the problem > >identified by Brian. > > > >Bill, under what kinds of circumstances would you anticipate a > >significant time penalty? I would be happy to check those out with > >some simulations. > > > >If the timing seems acceptable, I can write a patch for tapply.R and > >tapply.Rd if anyone in the core is willing to consider them. Please > >contact me on or off list if so. > > > > > > There's another concern: tapply (et al.) has the ... args passed on to > FUN which means that you have to be really careful with argument names. > > Could I just interject that we already have > > > airquality$Month <- factor(airquality$Month,levels=4:9) # April not there > > unlist(lapply( > + split(airquality$Ozone, airquality$Month, drop=F),sum, na.rm=T)) > 4 5 6 7 8 9 > 0 614 265 1537 1559 912 > > (splitting on multiple factors gets a bit involved, though)
For that matter, we have airquality$Month <- factor(airquality$Month,levels=4:9) air.sum <- tapply(airquality$Ozone, airquality$Month, sum, na.rm=T) air.sum[is.na(air.sum)] <- 0 which is equivalent to what I ended up using whilst fiddling with tapply. Andrew > >Best wishes to all, > > > >Andrew > > > > > > > > > >On Tue, Nov 06, 2007 at 07:23:56AM +0000, Prof Brian Ripley wrote: > > > >>On Tue, 6 Nov 2007, [EMAIL PROTECTED] wrote: > >> > >> > >>>Unfortunately I think it would break too much existing code. tapply() > >>>is an old function and many people have gotten used to the way it works > >>>now. > >>> > >>It is also not necessarily desirable: FUN(numeric(0)) might be an error. > >>For example: > >> > >> > >>>Z <- data.frame(x=rnorm(10), f=rep(c("a", "b"), each=5))[1:5, ] > >>>tapply(Z$x, Z$f, sd) > >>> > >>but sd(numeric(0)) is an error. (Similar things involving var are 'in > >>the wild' and so would be broken.) > >> > >> > >>>This is not to suggest there could not be another argument added at the > >>>end to indicate that you want the new behaviour, though. e.g. > >>> > >>>tapply <- function (X, INDEX, FUN=NULL, ..., simplify=TRUE, > >>>handle.empty.levels = FALSE) > >>> > >>>but this raises the question of what sort of time penalty the > >>>modification might entail. Probably not much for most situations, I > >>>suppose. (I know this argument name looks long, but you do need a > >>>fairly specific argument name, or it will start to impinge on the ... > >>>argument.) > >>> > >>>Just some thoughts. > >>> > >>>Bill Venables. > >>> > >>>Bill Venables > >>>CSIRO Laboratories > >>>PO Box 120, Cleveland, 4163 > >>>AUSTRALIA > >>>Office Phone (email preferred): +61 7 3826 7251 > >>>Fax (if absolutely necessary): +61 7 3826 7304 > >>>Mobile: +61 4 8819 4402 > >>>Home Phone: +61 7 3286 7700 > >>>mailto:[EMAIL PROTECTED] > >>>http://www.cmis.csiro.au/bill.venables/ > >>> > >>>-----Original Message----- > >>>From: [EMAIL PROTECTED] > >>>[mailto:[EMAIL PROTECTED] On Behalf Of Andrew Robinson > >>>Sent: Tuesday, 6 November 2007 3:10 PM > >>>To: R-Devel > >>>Subject: [Rd] A suggestion for an amendment to tapply > >>> > >>>Dear R-developers, > >>> > >>>when tapply() is invoked on factors that have empty levels, it returns > >>>NA. This behaviour is in accord with the tapply documentation, and is > >>>reasonable in many cases. However, when FUN is sum, it would also > >>>seem reasonable to return 0 instead of NA, because "the sum of an > >>>empty set is zero, by definition." > >>> > >>>I'd like to raise a discussion of the possibility of an amendment to > >>>tapply. > >>> > >>>The attached patch changes the function so that it checks if there are > >>>any empty levels, and if there are, replaces the corresponding NA > >>>values with the result of applying FUN to the empty set. Eg in the > >>>case of sum, it replaces the NA with 0, whereas with mean, it replaces > >>>the NA with NA, and issues a warning. > >>> > >>>This change has the following advantage: tapply and sum work better > >>>together. Arguably, tapply and any other function that has a non-NA > >>>response to the empty set will also work better together. > >>>Furthermore, tapply shows a warning if FUN would normally show a > >>>warning upon being evaluated on an empty set. That deviates from > >>>current behaviour, which might be bad, but also provides information > >>>that might be useful to the user, so that would be good. > >>> > >>>The attached script provides the new function in full, and > >>>demonstrates its application in some simple test cases. > >>> > >>>Best wishes, > >>> > >>>Andrew > >>> > >>> > >>-- > >>Brian D. Ripley, [EMAIL PROTECTED] > >>Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ > >>University of Oxford, Tel: +44 1865 272861 (self) > >>1 South Parks Road, +44 1865 272866 (PA) > >>Oxford OX1 3TG, UK Fax: +44 1865 272595 > >> > > > > > > > -- > O__ ---- Peter Dalgaard ?ster Farimagsgade 5, Entr.B > c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K > (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 > ~~~~~~~~~~ - ([EMAIL PROTECTED]) FAX: (+45) 35327907 -- Andrew Robinson Department of Mathematics and Statistics Tel: +61-3-8344-9763 University of Melbourne, VIC 3010 Australia Fax: +61-3-8344-4599 http://www.ms.unimelb.edu.au/~andrewpr http://blogs.mbs.edu/fishing-in-the-bay/ ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel