Re: [Rd] suggestion for extending ?as.factor

2009-05-05 Thread Petr Savicky
On Mon, May 04, 2009 at 07:28:06PM +0200, Peter Dalgaard wrote:
> Petr Savicky wrote:
> > For this, we get
> > 
> >   > convert(0.3)
> >   [1] "0.3"
> >   > convert(1/3)
> >   [1] "0." # 16 digits suffice
> >   > convert(0.12345)
> >   [1] "0.12345"
> >   > convert(0.12345678901234567)
> >   [1] "0.12345678901234566"
> >   > 0.12345678901234567 == as.numeric("0.12345678901234566")
> >   [1] TRUE
> > 
> > This algorithm is slower than a single call to sprintf("%.17g", x), but it
> > produces nicer numbers, if possible, and guarantees that the value is
> > always preserved.
> 
> Yes, but
> 
> > convert(0.2+0.1)
> [1] "0.30004"

I am not sure whether this should be considered an error. Computing with decimal
numbers requires some care. If we want to get the result, which is the closest
to 0.3, it is better to use round(0.1+0.2, digits=1).
 
> I think that the real issue is that we actually do want almost-equal
> numbers to be folded together.

Yes, this is frequently needed. A related question of an approximate match
in a numeric sequence was discussed in R-devel thread "Match .3 in a sequence"
from March. In order to fold almost-equal numbers together, we need to specify
a tolerance to use. The tolerance depends on application. In my opinion, it is
hard to choose a tolerance, which could be a good default.

> The most relevant case I can conjure up
> is this (permutation testing):
> 
> > zz <- replicate(2,sum(sample(sleep$extra,10)))
> > length(table(zz))
> [1] 427
> > length(table(signif(zz,7)))
> [1] 281

In order to obtain the correct result in this example, it is possible to use
  zz <- signif(zz,7)
as you suggest or
  zz <- round(zz, digits=1)
and use the resulting zz for all further calculations.

> Notice that the discrepancy comes from sums that really are identical
> values (in decimal arithmetic), but where the binary FP inaccuracy makes
> them slightly different.
> 
> [for a nice picture, continue the example with
> 
> > tt <- table(signif(zz,7))
> > plot(as.numeric(names(tt)),tt, type="h")

The form of this picture is not due to rounding errors. The picture may be
obtained even within an integer arithmetic as follows.

  ss <- round(10*sleep$extra)
  zz <- replicate(2,sum(sample(ss,10)))
  tt <- table(zz)
  plot(as.numeric(names(tt)),tt, type="h")

The variation of the frequencies is due to two effects.

First, each individual value of the sum occurs with low probability, so 2
replications do not suffice to get low variance of individual frequencies. Using
1000 repetitions of the code above, i obtained estimate of some of the 
probabilities.
The most frequent sums have probability approximately p=0.0089 for a single 
sample.
With n=2 replications, we get the mean frequency p*n = 178 and standard 
deviation
sqrt(p*(1-p)*n) = 13.28216.

The other cause of variation of the frequencies is that even the true 
distribution of
the sums has a lot of local minima and maxima. The mean of 1000 repetitions of 
the above
table restricted to values of the sum in the interval 140:168 produced the 
estimates

  value mean frequency (over 1000 tables)
  140   172.411
  141   172.090
  142   174.297
  143   166.039
  144   159.260
  145   163.891
  146   162.317
  147   165.460
  148   177.870
  149   177.971
  150   177.754
  151   178.525 local maximum
  152   169.851
  153   164.443 local minimum
  154   168.488 the mean value of the sum
  155   164.816 local minimum
  156   169.297
  157   179.248 local maximum
  158   177.799
  159   176.743
  160   177.777
  161   164.173
  162   162.585
  163   164.641
  164   159.913
  165   165.932
  166   173.014
  167   172.276
  168   171.612
The local minima and maxima are visible here. The mean value 154 is 
approximately
the center of the histogram.

Petr.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] suggestion for extending ?as.factor

2009-05-05 Thread Martin Maechler
> "PD" == Peter Dalgaard 
> on Mon, 04 May 2009 19:28:06 +0200 writes:

PD> Petr Savicky wrote:
>> On Mon, May 04, 2009 at 05:39:52PM +0200, Martin Maechler wrote:
>> [snip]
>>> Let me quickly expand the tasks we have wanted to address, when
>>> I started changing factor() for R-devel.
>>> 
>>> 1) R-core had unanimously decided that R 2.10.0 should not allow
>>> duplicated levels in factors anymore.
>>> 
>>> When working on that, I had realized that quite a few bits of code
>>> were implicitly relying on duplicated levels (or something
>>> related), see below, so the current version of R-devel only
>>> *warns* in some cases where duplicated levels are produced
>>> instead of giving an error.
>>> 
>>> What I had also found was that basically, even our own (!) code
>>> and quite a bit of user code has more or less relied on other
>>> things that were not true (even though "almost always" fulfilled):
>>> 
>>> 2) if x contains no duplicated values, then  factor(x) should neither
>>> 
>>> 3) factor(x) constructs a factor object with *unique* levels
>>> 
>>> {This is what our decision "1)" implies and now enforces}
>>> 
>>> 4) as.numeric(names(table(x))) should be  identical to unique(x)
>>> 
>>> where "4)" is basically ensured by "3)" as table() calls
>>> factor() for non-factor args.
>>> 
>>> As mentioned the bad thing is that "2) - 4)" are typically
>>> fulfilled in all tests package writers would use.
>>> 
>>> Concerning '3)' [and '1)'], as you know, inside R-core we have
>>> proposed to at least ensure that  `levels<-` 
>>> should not allow duplicated levels, 
>>> and I had concluded that
>>> a) factor() really should use  `levels<-` instead of the low-level  
>>> attr(., "levels") <- 
>>> b) factor() itself must make sure that the default levels became unique.
>>> 
>>> ---
>>> 
>>> Given Petr's (and more) examples and the strong requirement of
>>> "user convenience" and back-compatibility,
>>> I now tend to agree (with Peter) that we cannot ensure all of 2)
>>> and 4) still allow factor() to behave as it did for "rounded
>>> decimal numbers",
>>> and consequently would have to (continue to) not ensuring
>>> properties (2) and (4).
>>> Something quite unfortunate, since, as I said, much useR code
>>> implicitly relies on these, and so that code is buggy even
>>> though the bug will only show in exceptional cases.
>> 
>> Let me suggest to consider also the following algorithm: determine
>> the number of digits needed to preserve the double value exactly for
>> each number separately. An R code prototype demonstrating the 
>> algorithm could be as follows
>> 
>> convert <- function(x) # x should be a single number
>> {
>> for (d in 1:16) {
>> y <- sprintf(paste("%.", d, "g", sep=""), x)
>> if (x == as.numeric(y)) {
>> return(y)
>> }
>> }
>> return(sprintf("%.17g", x))
>> }
>> 
>> For this, we get
>> 
>> > convert(0.3)
>> [1] "0.3"
>> > convert(1/3)
>> [1] "0." # 16 digits suffice
>> > convert(0.12345)
>> [1] "0.12345"
>> > convert(0.12345678901234567)
>> [1] "0.12345678901234566"
>> > 0.12345678901234567 == as.numeric("0.12345678901234566")
>> [1] TRUE
>> 
>> This algorithm is slower than a single call to sprintf("%.17g", x), but 
it
>> produces nicer numbers, if possible, and guarantees that the value is
>> always preserved.

PD> Yes, but

>> convert(0.2+0.1)
PD> [1] "0.30004"


PD> I think that the real issue is that we actually do want almost-equal
PD> numbers to be folded together. 

in most cases, but not all {*when*  levels is not specified},
but useR's code sometimes *does* rely on  factor()  /  table()
using exact values.

Also, what should happen when the user explicitly calls

  factor(x, levels = sort(unique(x)))

at least in that case we really should *not* fold almost equals.
and the "old" code (<= R 2.9.0) did fold them in border cases,
and lead non-unique levels.

Can we agree that any rounding etc - if needed - will only
happen when
  1) missing(levels)
  2) is.numeric(x) || is.complex(x)

I'm also thinking of at least keeping the current behavior as an
option, e.g. by  factor(x, , keepUniqueness = TRUE, )
where the default would be keepUniqueness = FALSE.

PD> The most relevant case I can conjure up is this (permutation testing):

>> zz <- replicate(2,sum(sample(sleep$extra,10)))
>> length(table(zz))
PD> [1] 427
>> length(table(signif(zz,7)))
PD> [1] 281

PD> Notice that the discrepancy comes from sums that really are identical
PD> values (in decimal arithmetic), but where the binary FP inaccuracy makes
PD> them slightly different.


Re: [Rd] Rd parsing

2009-05-05 Thread Duncan Murdoch

On 05/05/2009 5:04 AM, robin hankin wrote:

Hi Duncan

I asked this question to R-devel, and there was no reply, so I thought
I'd ask you
directly.  Any ideas?

best wishes

Robin




I am having difficulty including a LaTeX formula in an Rd
file.

The example given in section 2.7 in 'Parsing Rd files' is:


\deqn{ f(x) = \left\{
  \begin{array}{ll}
  0 & x<0 \\
  1 & x\ge 0
  \end{array}
  \right. }{non latex}


For me, this gives:

\deqn{ f(x) = \left\{
\begin{array}{ll}
0 \& x<0 \bsl{}
1 \& x\ge 0
\end{array}
\right. }{}

in the tex file, which is not  desired because the ampersand
is escaped; the '&' symbol appears in the dvi file, and I
want an ampersand  to indicate  alignment.

Also, the '\\' appears as \bsl{}, which is undesired; the
 resulting dvi file (made by R CMD Rd2dvi) looks wrong.

How do I write the Rd file so as to produce non-escaped
ampersands?


I think the way the docs show it should work but there's a bug in the 
conversion.  If you use the new converter tools::Rd2latex, it comes out 
as expected.  So there's a bug in the Perl version of Rdconv, presumably 
in share/perl/R/Rdconv.pm, and you might want to try to track it down 
and fix it if you know Perl well enough.  An alternative might be some 
kludge involving \input{}:  put the equation in its own file where 
Rdconv will ignore it.


I've cc'd R-devel on this again; maybe someone else can jump in with a 
better idea.


Duncan Murdoch

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] suggestion for extending ?as.factor

2009-05-05 Thread Peter Dalgaard
Petr Savicky wrote:


> 
>> Notice that the discrepancy comes from sums that really are identical
>> values (in decimal arithmetic), but where the binary FP inaccuracy makes
>> them slightly different.
>>
>> [for a nice picture, continue the example with
>>
>>> tt <- table(signif(zz,7))
>>> plot(as.numeric(names(tt)),tt, type="h")
> 
> The form of this picture is not due to rounding errors. The picture may be
> obtained even within an integer arithmetic as follows.
> 
>   ss <- round(10*sleep$extra)
>   zz <- replicate(2,sum(sample(ss,10)))
>   tt <- table(zz)
>   plot(as.numeric(names(tt)),tt, type="h")

I know. The point was rather that if you are not careful with rounding,
you get the some of the bars wrong (you get 2 or 3 small bars very close
to each other instead of one longer one). Computed p values from
permutation tests (as in mean(sim>=obs)) also need care for the same reason.

> 
> The variation of the frequencies is due to two effects.
> 
> First, each individual value of the sum occurs with low probability, so 2


> 
> The other cause of variation of the frequencies is that even the true 
> distribution of
> the sums has a lot of local minima and maxima. 

Yes. You can actually generate the exact distribution easily using

d <- combn(sleep$extra, 10, sum)
d <- signif(d,7)
tt <- table(d)
plot(as.numeric(names(tt)),tt, type="h")

and if you omit the signif() bit (not with R-devel):

> table(table(names(table(d

  1   2   3
137 161  17

i.e. 315 distinct values but over half occur in duplicate or triplicate
versions.


-- 
   O__   Peter Dalgaard Ă˜ster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark  Ph:  (+45) 35327918
~~ - (p.dalga...@biostat.ku.dk)  FAX: (+45) 35327907

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] unexpected behavior of rpart 3.1-43, loss matrix

2009-05-05 Thread Liaw, Andy
Just expressing MHO:  The  algorithm cannot give predictions in classes
that never appear in the training data, so any entries in the loss
matrix related to such classes are irrelevant w.r.t. the training data.
They should be removed before feeding to rpart (or any other algorithm
that can make use of a loss matrix).  As I see it, it's the
responsibility of the data analyst to take care of such things.  The
current error message may not make it obvious what the problem is, but
if I were the developer, I would not write the code to accept such
disparate input without issuing error.

Andy 

> -Original Message-
> From: r-devel-boun...@r-project.org 
> [mailto:r-devel-boun...@r-project.org] On Behalf Of Lars
> Sent: Thursday, April 30, 2009 12:43 PM
> To: r-devel@r-project.org
> Subject: [Rd] unexpected behavior of rpart 3.1-43, loss matrix
> 
> Hi,
> 
> I just noticed that rpart behaves unexpectecly, when performing
> classification learning and specifying a loss matrix.
> if the response variable y is a factor and if not all levels of the
> factor  occur in the observations, rpart exits with an error:
> 
> 
> > df=data.frame(attr=1:5,class=factor(c(2,3,1,5,3),levels=1:6))
> > rpart(class~attr,df,parms=list(loss=matrix(0,6,6)))
> Error in (get(paste("rpart", method, sep = ".")))(Y, offset, 
> parms, wt)
> :   Wrong length for loss matrix
> 
> 
> note that while the levels of the factor range from 1:6, for the
> concrete obseration data, only levels 1, 2, 3, 5 do occur.
> 
> the error is caused by the code of rpart.class:
> 
>  fy <- as.factor(y)
>  y <- as.integer(fy)
>  numclass <- max(y[!is.na(y)])
> ...
> 
> temp2 <- parms$loss
> if (length(temp2) != numclass^2)
>   stop("Wrong length for loss matrix")
> 
> 
> for the example, numclass is set to 5 instead of 6.
> 
> 
> while for that small example, it may be discussable whether or not
> numclass should be 6, consider a set of data for that the response
> variable has a certain range. Then, it may be the case that for some
> data, not all levels of the response variable do occur. at the same
> time, it is desirable to use the same loss matrix when training a
> deicision tree from the data.
> 
> 
> having said that, i am very happy with the rpart package and with its
> high configurability.
> 
> best regards
> lars
> 
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
Notice:  This e-mail message, together with any attachme...{{dropped:12}}

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] script window background and text color (PR#13446)

2009-05-05 Thread Michal Bojanowski
 4dscape.com> writes:
> Full_Name: alexios galanos
> Version: 2.8.1
> OS: windows/vista
> Submission from: (NULL) (81.100.160.71)
> 
> While the script editor now respects user preferences for the background color
> in 2.8.1, it does not do so for the user text color defaulting to black. So my
> preference of having for example black background with green text fails (it is
> all black) in the script editor (the console is ok).

I believe this is not a bug, at least not in R 2.9.0. I did not check it
in 2.8.1 to which you refer. Perhaps it is just the lack of
documentation: I could not find any documentation for the options below.

The colors in the Windows GUI console, pager, data editor and syntax
editor can be set under Windows both using the GUI and with appropriate
entries in Rprofile. 

In Rprofile you can use settings:

For the console:
background
normaltext
usertext

For pager:
pagerbg
pagertext
highlight

Data editor:
dataeditbg
dataedittext
dataedituser

Syntax editor:
editorbg
editortext

For example you can get black background and green text in the syntax
editor by adding the following to your Rprofile:

editorbg = black
editortext = green

I'm not sure whether all but the initial three were available in earlier
versions of R. Perhaps they were but they inherited the values from the
first three...

Perhaps this should be documented somwhere? In the Windows Rd file for
file.show, data.entry?

Best,
Michal

___
Michal Bojanowski
Department of Sociology, Utrecht University
m.j.bojanow...@uu.nl
http://bojan.3e.pl/weblog

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] suggestion for extending ?as.factor

2009-05-05 Thread Petr Savicky
On Tue, May 05, 2009 at 11:27:36AM +0200, Peter Dalgaard wrote:
> I know. The point was rather that if you are not careful with rounding,
> you get the some of the bars wrong (you get 2 or 3 small bars very close
> to each other instead of one longer one). Computed p values from
> permutation tests (as in mean(sim>=obs)) also need care for the same reason.

OK. Now, i understand the point of the example. I think that it is
the responsibility of the user to find the right way to eliminate the
influence of the rounding errors, since this may require a different
approach in different situations. However, i can also accept the
point of view that as.factor() should do this to some extent by default.

For example, we may require that as.factor() is consistent with 
as.character() in the way how to map different numbers to the same
string.

At the first glance, one could expect that to implement this, it is 
sufficient if as.factor(x) performs
  x <- as.numeric(as.character(x))
  levels <- as.character(sort(unique(x)))

Unfortunately, on some platforms (tested on Intel with SSE, R-2.10.0,
2009-05-02 r48453), this may produce repeated levels.

  x <- c(0.6807853176681814000304, 0.6807853176681809559412)
  x <- as.numeric(as.character(x))
  levels <- as.character(sort(unique(x)))
  levels # "0.68078531766818" "0.68078531766818"
  levels[1] == levels[2] # TRUE

Using the default Intel arithmetic, we get a single level, namely 
"0.680785317668181".

Petr.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel