Re: [Rd] suggestion for extending ?as.factor
On Mon, May 04, 2009 at 07:28:06PM +0200, Peter Dalgaard wrote: > Petr Savicky wrote: > > For this, we get > > > > > convert(0.3) > > [1] "0.3" > > > convert(1/3) > > [1] "0." # 16 digits suffice > > > convert(0.12345) > > [1] "0.12345" > > > convert(0.12345678901234567) > > [1] "0.12345678901234566" > > > 0.12345678901234567 == as.numeric("0.12345678901234566") > > [1] TRUE > > > > This algorithm is slower than a single call to sprintf("%.17g", x), but it > > produces nicer numbers, if possible, and guarantees that the value is > > always preserved. > > Yes, but > > > convert(0.2+0.1) > [1] "0.30004" I am not sure whether this should be considered an error. Computing with decimal numbers requires some care. If we want to get the result, which is the closest to 0.3, it is better to use round(0.1+0.2, digits=1). > I think that the real issue is that we actually do want almost-equal > numbers to be folded together. Yes, this is frequently needed. A related question of an approximate match in a numeric sequence was discussed in R-devel thread "Match .3 in a sequence" from March. In order to fold almost-equal numbers together, we need to specify a tolerance to use. The tolerance depends on application. In my opinion, it is hard to choose a tolerance, which could be a good default. > The most relevant case I can conjure up > is this (permutation testing): > > > zz <- replicate(2,sum(sample(sleep$extra,10))) > > length(table(zz)) > [1] 427 > > length(table(signif(zz,7))) > [1] 281 In order to obtain the correct result in this example, it is possible to use zz <- signif(zz,7) as you suggest or zz <- round(zz, digits=1) and use the resulting zz for all further calculations. > Notice that the discrepancy comes from sums that really are identical > values (in decimal arithmetic), but where the binary FP inaccuracy makes > them slightly different. > > [for a nice picture, continue the example with > > > tt <- table(signif(zz,7)) > > plot(as.numeric(names(tt)),tt, type="h") The form of this picture is not due to rounding errors. The picture may be obtained even within an integer arithmetic as follows. ss <- round(10*sleep$extra) zz <- replicate(2,sum(sample(ss,10))) tt <- table(zz) plot(as.numeric(names(tt)),tt, type="h") The variation of the frequencies is due to two effects. First, each individual value of the sum occurs with low probability, so 2 replications do not suffice to get low variance of individual frequencies. Using 1000 repetitions of the code above, i obtained estimate of some of the probabilities. The most frequent sums have probability approximately p=0.0089 for a single sample. With n=2 replications, we get the mean frequency p*n = 178 and standard deviation sqrt(p*(1-p)*n) = 13.28216. The other cause of variation of the frequencies is that even the true distribution of the sums has a lot of local minima and maxima. The mean of 1000 repetitions of the above table restricted to values of the sum in the interval 140:168 produced the estimates value mean frequency (over 1000 tables) 140 172.411 141 172.090 142 174.297 143 166.039 144 159.260 145 163.891 146 162.317 147 165.460 148 177.870 149 177.971 150 177.754 151 178.525 local maximum 152 169.851 153 164.443 local minimum 154 168.488 the mean value of the sum 155 164.816 local minimum 156 169.297 157 179.248 local maximum 158 177.799 159 176.743 160 177.777 161 164.173 162 162.585 163 164.641 164 159.913 165 165.932 166 173.014 167 172.276 168 171.612 The local minima and maxima are visible here. The mean value 154 is approximately the center of the histogram. Petr. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] suggestion for extending ?as.factor
> "PD" == Peter Dalgaard > on Mon, 04 May 2009 19:28:06 +0200 writes: PD> Petr Savicky wrote: >> On Mon, May 04, 2009 at 05:39:52PM +0200, Martin Maechler wrote: >> [snip] >>> Let me quickly expand the tasks we have wanted to address, when >>> I started changing factor() for R-devel. >>> >>> 1) R-core had unanimously decided that R 2.10.0 should not allow >>> duplicated levels in factors anymore. >>> >>> When working on that, I had realized that quite a few bits of code >>> were implicitly relying on duplicated levels (or something >>> related), see below, so the current version of R-devel only >>> *warns* in some cases where duplicated levels are produced >>> instead of giving an error. >>> >>> What I had also found was that basically, even our own (!) code >>> and quite a bit of user code has more or less relied on other >>> things that were not true (even though "almost always" fulfilled): >>> >>> 2) if x contains no duplicated values, then factor(x) should neither >>> >>> 3) factor(x) constructs a factor object with *unique* levels >>> >>> {This is what our decision "1)" implies and now enforces} >>> >>> 4) as.numeric(names(table(x))) should be identical to unique(x) >>> >>> where "4)" is basically ensured by "3)" as table() calls >>> factor() for non-factor args. >>> >>> As mentioned the bad thing is that "2) - 4)" are typically >>> fulfilled in all tests package writers would use. >>> >>> Concerning '3)' [and '1)'], as you know, inside R-core we have >>> proposed to at least ensure that `levels<-` >>> should not allow duplicated levels, >>> and I had concluded that >>> a) factor() really should use `levels<-` instead of the low-level >>> attr(., "levels") <- >>> b) factor() itself must make sure that the default levels became unique. >>> >>> --- >>> >>> Given Petr's (and more) examples and the strong requirement of >>> "user convenience" and back-compatibility, >>> I now tend to agree (with Peter) that we cannot ensure all of 2) >>> and 4) still allow factor() to behave as it did for "rounded >>> decimal numbers", >>> and consequently would have to (continue to) not ensuring >>> properties (2) and (4). >>> Something quite unfortunate, since, as I said, much useR code >>> implicitly relies on these, and so that code is buggy even >>> though the bug will only show in exceptional cases. >> >> Let me suggest to consider also the following algorithm: determine >> the number of digits needed to preserve the double value exactly for >> each number separately. An R code prototype demonstrating the >> algorithm could be as follows >> >> convert <- function(x) # x should be a single number >> { >> for (d in 1:16) { >> y <- sprintf(paste("%.", d, "g", sep=""), x) >> if (x == as.numeric(y)) { >> return(y) >> } >> } >> return(sprintf("%.17g", x)) >> } >> >> For this, we get >> >> > convert(0.3) >> [1] "0.3" >> > convert(1/3) >> [1] "0." # 16 digits suffice >> > convert(0.12345) >> [1] "0.12345" >> > convert(0.12345678901234567) >> [1] "0.12345678901234566" >> > 0.12345678901234567 == as.numeric("0.12345678901234566") >> [1] TRUE >> >> This algorithm is slower than a single call to sprintf("%.17g", x), but it >> produces nicer numbers, if possible, and guarantees that the value is >> always preserved. PD> Yes, but >> convert(0.2+0.1) PD> [1] "0.30004" PD> I think that the real issue is that we actually do want almost-equal PD> numbers to be folded together. in most cases, but not all {*when* levels is not specified}, but useR's code sometimes *does* rely on factor() / table() using exact values. Also, what should happen when the user explicitly calls factor(x, levels = sort(unique(x))) at least in that case we really should *not* fold almost equals. and the "old" code (<= R 2.9.0) did fold them in border cases, and lead non-unique levels. Can we agree that any rounding etc - if needed - will only happen when 1) missing(levels) 2) is.numeric(x) || is.complex(x) I'm also thinking of at least keeping the current behavior as an option, e.g. by factor(x, , keepUniqueness = TRUE, ) where the default would be keepUniqueness = FALSE. PD> The most relevant case I can conjure up is this (permutation testing): >> zz <- replicate(2,sum(sample(sleep$extra,10))) >> length(table(zz)) PD> [1] 427 >> length(table(signif(zz,7))) PD> [1] 281 PD> Notice that the discrepancy comes from sums that really are identical PD> values (in decimal arithmetic), but where the binary FP inaccuracy makes PD> them slightly different.
Re: [Rd] Rd parsing
On 05/05/2009 5:04 AM, robin hankin wrote: Hi Duncan I asked this question to R-devel, and there was no reply, so I thought I'd ask you directly. Any ideas? best wishes Robin I am having difficulty including a LaTeX formula in an Rd file. The example given in section 2.7 in 'Parsing Rd files' is: \deqn{ f(x) = \left\{ \begin{array}{ll} 0 & x<0 \\ 1 & x\ge 0 \end{array} \right. }{non latex} For me, this gives: \deqn{ f(x) = \left\{ \begin{array}{ll} 0 \& x<0 \bsl{} 1 \& x\ge 0 \end{array} \right. }{} in the tex file, which is not desired because the ampersand is escaped; the '&' symbol appears in the dvi file, and I want an ampersand to indicate alignment. Also, the '\\' appears as \bsl{}, which is undesired; the resulting dvi file (made by R CMD Rd2dvi) looks wrong. How do I write the Rd file so as to produce non-escaped ampersands? I think the way the docs show it should work but there's a bug in the conversion. If you use the new converter tools::Rd2latex, it comes out as expected. So there's a bug in the Perl version of Rdconv, presumably in share/perl/R/Rdconv.pm, and you might want to try to track it down and fix it if you know Perl well enough. An alternative might be some kludge involving \input{}: put the equation in its own file where Rdconv will ignore it. I've cc'd R-devel on this again; maybe someone else can jump in with a better idea. Duncan Murdoch __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] suggestion for extending ?as.factor
Petr Savicky wrote: > >> Notice that the discrepancy comes from sums that really are identical >> values (in decimal arithmetic), but where the binary FP inaccuracy makes >> them slightly different. >> >> [for a nice picture, continue the example with >> >>> tt <- table(signif(zz,7)) >>> plot(as.numeric(names(tt)),tt, type="h") > > The form of this picture is not due to rounding errors. The picture may be > obtained even within an integer arithmetic as follows. > > ss <- round(10*sleep$extra) > zz <- replicate(2,sum(sample(ss,10))) > tt <- table(zz) > plot(as.numeric(names(tt)),tt, type="h") I know. The point was rather that if you are not careful with rounding, you get the some of the bars wrong (you get 2 or 3 small bars very close to each other instead of one longer one). Computed p values from permutation tests (as in mean(sim>=obs)) also need care for the same reason. > > The variation of the frequencies is due to two effects. > > First, each individual value of the sum occurs with low probability, so 2 > > The other cause of variation of the frequencies is that even the true > distribution of > the sums has a lot of local minima and maxima. Yes. You can actually generate the exact distribution easily using d <- combn(sleep$extra, 10, sum) d <- signif(d,7) tt <- table(d) plot(as.numeric(names(tt)),tt, type="h") and if you omit the signif() bit (not with R-devel): > table(table(names(table(d 1 2 3 137 161 17 i.e. 315 distinct values but over half occur in duplicate or triplicate versions. -- O__ Peter Dalgaard Ă˜ster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~ - (p.dalga...@biostat.ku.dk) FAX: (+45) 35327907 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] unexpected behavior of rpart 3.1-43, loss matrix
Just expressing MHO: The algorithm cannot give predictions in classes that never appear in the training data, so any entries in the loss matrix related to such classes are irrelevant w.r.t. the training data. They should be removed before feeding to rpart (or any other algorithm that can make use of a loss matrix). As I see it, it's the responsibility of the data analyst to take care of such things. The current error message may not make it obvious what the problem is, but if I were the developer, I would not write the code to accept such disparate input without issuing error. Andy > -Original Message- > From: r-devel-boun...@r-project.org > [mailto:r-devel-boun...@r-project.org] On Behalf Of Lars > Sent: Thursday, April 30, 2009 12:43 PM > To: r-devel@r-project.org > Subject: [Rd] unexpected behavior of rpart 3.1-43, loss matrix > > Hi, > > I just noticed that rpart behaves unexpectecly, when performing > classification learning and specifying a loss matrix. > if the response variable y is a factor and if not all levels of the > factor occur in the observations, rpart exits with an error: > > > > df=data.frame(attr=1:5,class=factor(c(2,3,1,5,3),levels=1:6)) > > rpart(class~attr,df,parms=list(loss=matrix(0,6,6))) > Error in (get(paste("rpart", method, sep = ".")))(Y, offset, > parms, wt) > : Wrong length for loss matrix > > > note that while the levels of the factor range from 1:6, for the > concrete obseration data, only levels 1, 2, 3, 5 do occur. > > the error is caused by the code of rpart.class: > > fy <- as.factor(y) > y <- as.integer(fy) > numclass <- max(y[!is.na(y)]) > ... > > temp2 <- parms$loss > if (length(temp2) != numclass^2) > stop("Wrong length for loss matrix") > > > for the example, numclass is set to 5 instead of 6. > > > while for that small example, it may be discussable whether or not > numclass should be 6, consider a set of data for that the response > variable has a certain range. Then, it may be the case that for some > data, not all levels of the response variable do occur. at the same > time, it is desirable to use the same loss matrix when training a > deicision tree from the data. > > > having said that, i am very happy with the rpart package and with its > high configurability. > > best regards > lars > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > Notice: This e-mail message, together with any attachme...{{dropped:12}} __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] script window background and text color (PR#13446)
4dscape.com> writes: > Full_Name: alexios galanos > Version: 2.8.1 > OS: windows/vista > Submission from: (NULL) (81.100.160.71) > > While the script editor now respects user preferences for the background color > in 2.8.1, it does not do so for the user text color defaulting to black. So my > preference of having for example black background with green text fails (it is > all black) in the script editor (the console is ok). I believe this is not a bug, at least not in R 2.9.0. I did not check it in 2.8.1 to which you refer. Perhaps it is just the lack of documentation: I could not find any documentation for the options below. The colors in the Windows GUI console, pager, data editor and syntax editor can be set under Windows both using the GUI and with appropriate entries in Rprofile. In Rprofile you can use settings: For the console: background normaltext usertext For pager: pagerbg pagertext highlight Data editor: dataeditbg dataedittext dataedituser Syntax editor: editorbg editortext For example you can get black background and green text in the syntax editor by adding the following to your Rprofile: editorbg = black editortext = green I'm not sure whether all but the initial three were available in earlier versions of R. Perhaps they were but they inherited the values from the first three... Perhaps this should be documented somwhere? In the Windows Rd file for file.show, data.entry? Best, Michal ___ Michal Bojanowski Department of Sociology, Utrecht University m.j.bojanow...@uu.nl http://bojan.3e.pl/weblog __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] suggestion for extending ?as.factor
On Tue, May 05, 2009 at 11:27:36AM +0200, Peter Dalgaard wrote: > I know. The point was rather that if you are not careful with rounding, > you get the some of the bars wrong (you get 2 or 3 small bars very close > to each other instead of one longer one). Computed p values from > permutation tests (as in mean(sim>=obs)) also need care for the same reason. OK. Now, i understand the point of the example. I think that it is the responsibility of the user to find the right way to eliminate the influence of the rounding errors, since this may require a different approach in different situations. However, i can also accept the point of view that as.factor() should do this to some extent by default. For example, we may require that as.factor() is consistent with as.character() in the way how to map different numbers to the same string. At the first glance, one could expect that to implement this, it is sufficient if as.factor(x) performs x <- as.numeric(as.character(x)) levels <- as.character(sort(unique(x))) Unfortunately, on some platforms (tested on Intel with SSE, R-2.10.0, 2009-05-02 r48453), this may produce repeated levels. x <- c(0.6807853176681814000304, 0.6807853176681809559412) x <- as.numeric(as.character(x)) levels <- as.character(sort(unique(x))) levels # "0.68078531766818" "0.68078531766818" levels[1] == levels[2] # TRUE Using the default Intel arithmetic, we get a single level, namely "0.680785317668181". Petr. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel