On Mon, May 04, 2009 at 05:39:52PM +0200, Martin Maechler wrote: [snip] > Let me quickly expand the tasks we have wanted to address, when > I started changing factor() for R-devel. > > 1) R-core had unanimously decided that R 2.10.0 should not allow > duplicated levels in factors anymore. > > When working on that, I had realized that quite a few bits of code > were implicitly relying on duplicated levels (or something > related), see below, so the current version of R-devel only > *warns* in some cases where duplicated levels are produced > instead of giving an error. > > What I had also found was that basically, even our own (!) code > and quite a bit of user code has more or less relied on other > things that were not true (even though "almost always" fulfilled): > > 2) if x contains no duplicated values, then factor(x) should neither > > 3) factor(x) constructs a factor object with *unique* levels > > {This is what our decision "1)" implies and now enforces} > > 4) as.numeric(names(table(x))) should be identical to unique(x) > > where "4)" is basically ensured by "3)" as table() calls > factor() for non-factor args. > > As mentioned the bad thing is that "2) - 4)" are typically > fulfilled in all tests package writers would use. > > Concerning '3)' [and '1)'], as you know, inside R-core we have > proposed to at least ensure that `levels<-` > should not allow duplicated levels, > and I had concluded that > a) factor() really should use `levels<-` instead of the low-level > attr(., "levels") <- .... > b) factor() itself must make sure that the default levels became unique. > > --- > > Given Petr's (and more) examples and the strong requirement of > "user convenience" and back-compatibility, > I now tend to agree (with Peter) that we cannot ensure all of 2) > and 4) still allow factor() to behave as it did for "rounded > decimal numbers", > and consequently would have to (continue to) not ensuring > properties (2) and (4). > Something quite unfortunate, since, as I said, much useR code > implicitly relies on these, and so that code is buggy even > though the bug will only show in exceptional cases.
Let me suggest to consider also the following algorithm: determine the number of digits needed to preserve the double value exactly for each number separately. An R code prototype demonstrating the algorithm could be as follows convert <- function(x) # x should be a single number { for (d in 1:16) { y <- sprintf(paste("%.", d, "g", sep=""), x) if (x == as.numeric(y)) { return(y) } } return(sprintf("%.17g", x)) } For this, we get > convert(0.3) [1] "0.3" > convert(1/3) [1] "0.3333333333333333" # 16 digits suffice > convert(0.12345) [1] "0.12345" > convert(0.12345678901234567) [1] "0.12345678901234566" > 0.12345678901234567 == as.numeric("0.12345678901234566") [1] TRUE This algorithm is slower than a single call to sprintf("%.17g", x), but it produces nicer numbers, if possible, and guarantees that the value is always preserved. Petr. ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel