I have now committed the new split(x, f, drop = FALSE) to R-devel --- entailing non-backward compatible behavior, but consistency with factor indexing (and with S-plus) --- split() and "split<-" and unsplit() functions and methods to R-devel.
This does automatically fix the original posters "boxplot by factor" bug. >>>>> "MM" == Martin Maechler <[EMAIL PROTECTED]> >>>>> on Mon, 4 Jul 2005 09:15:59 +0200 writes: >>>>> "PD" == Peter Dalgaard <[EMAIL PROTECTED]> >>>>> on 28 Jun 2005 14:57:42 +0200 writes: PD> "Liaw, Andy" <[EMAIL PROTECTED]> writes: >>>> The issue is not with boxplot, but with split. boxplot.formula() >>>> calls boxplot(split(split(mf[[response]], mf[-response]), ...), >>>> but look at what split() returns when there are empty levels in >>>> the factor: >>>> >>>> > f <- factor(gl(3, 6), levels=1:5) >>>> > y <- rnorm(f) >>>> > split(y, f) >>>> $"1" >>>> [1] 0.4832124 1.1924811 0.3657797 1.7400198 0.5577356 0.9889520 >>>> >>>> $"2" >>>> [1] -1.1296642 -0.4808355 -0.2789933 0.1220718 0.1287742 -0.7573801 >>>> >>>> $"3" >>>> [1] 1.2320902 0.5090700 -1.5508074 2.1373780 1.1681297 -0.7151561 >>>> >>>> The "culprit" is the following in split.default(): >>>> >>>> f <- factor(f) >>>> >>>> which drops empty levels in f, if there are any. BTW, ?split doesn't >>>> mention what it does in such situation. Perhaps it should? >>>> >>>> If this is to be "fixed", I suppose an additional argument, e.g., >>>> drop=TRUE, can be added, and the corresponding line mentioned >>>> above changed to something like: >>>> >>>> if (drop || !is.factor(f)) f <- factor(f) >>>> >>>> Then this additional argument can be pass on from boxplot.formula() to >>>> split(). PD> Alternatively, I suspect that the intention was as.factor() rather PD> than factor(). MM> at first I thought Peter was right; but the real source of MM> split.default contains a comment (!) and that line is MM> f <- factor(f) # drop extraneous levels MM> so it seems, this was done there very much on purpose. MM> OTOH, S(-plus) has implemented it quite a bit differently, and actually MM> does keep the empty levels in the example MM> f <- factor(rep(1:3, each=6), levels=1:5); y <- rnorm(f); split(y, f) PD> It does require a bit of care to fix it that way, PD> though. There could be problems with empty levels popping up in PD> unexpected places. MM> Indeed! MM> Given the new facts, I think we want to go in Andy's direction MM> with a new argument, 'drop' MM> A Peter mentioned, the real question is about its default. MM> "drop = TRUE" would be fully compatible with previous versions of R. MM> "drop = FALSE" would be compatible with S and S-plus. MM> I'm going to implement it, and try to see if 'drop = FALSE' MM> gives changes for R and its standard packages; if 'yes', that MM> would be an indication that such a R-back-compatibility breaking MM> change was not a good idea. If 'no', I could commit it and see MM> if it has an effect on the CRAN packages.... MM> Of course, since split() and split()<- are S3 generics, and MM> since there's also unsplit(), this entails a whole slew of MM> changes {adding a "drop = FALSE" argument everywhere!} MM> and I presume will break everyone's code who has written own MM> split.foobar methods.... MM> great... MM> Martin MM> The change doesn't seem to affect the "standard" packages at all MM> which is good. On CRAN, it seems there are two packages only that MM> have split() or split()<- methods, namely 'spatstat' and 'compositions'. MM> If we introduced the extra argument 'drop', MM> these and every other user code defining split methods would MM> have to be updated to be compatible with the changed (S3) MM> generic having an extra argument 'drop'. MM> With this in mind, after more thought, I think that Peter's MM> initial proposal ---just replacing 'factor()' by 'as.factor()' MM> inside split--- seems to be nicer than introducing 'drop' and MM> *change* the default behavior to 'drop = FALSE' for the MM> following reasons : MM> 1) people who rely on the current behavior would have to change MM> their calls to split() anyway; MM> 2) instead of calling MM> split(x, f, drop=TRUE) MM> they can as well go for MM> split(x, factor(f)) MM> which has identical effect but does not introduce an extra MM> argument 'drop'. MM> 3) advantage of slightly higher compatibility with S MM> --- MM> I intend to change this in R-devel MM> {with appropriate notes in NEWS !} during this week, unless MM> someone finds good reasons for a different (or no) change. MM> Martin MM> ______________________________________________ MM> R-devel@r-project.org mailing list MM> https://stat.ethz.ch/mailman/listinfo/r-devel MM> !DSPAM:42c8e272288132092019954! ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel