[Rd] opening graphics device with a given number

2010-11-13 Thread Petr Savicky
I am preparing a script, which uses three graphical windows for different
types of plots. Before each plot, the device can be chosen by dev.set(k)
for k = 2, 3, 4. The function dev.set(k) does not create the required
device, if it is not opened already. So, instead, i use dev.add(k)
defined as

  dev.add <- function(k)
  {
  stopifnot(k >= 2 && k <= 63)
  while (all(dev.list() != k)) {
  dev.new()
  }
  dev.set(k)
  }

If the device k was not opened, then all the devices with numbers
smaller than k, which were not opened already, are created.
Usually, i use an interval of the device numbers, but sometimes,
it could be useful to open, say, device 4, even if device 3
is not needed. Is there a way, how to open a device with a
given number without opening devices with smaller numbers?

Petr Savicky.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] opening graphics device with a given number

2010-11-13 Thread Petr Savicky
On Sat, Nov 13, 2010 at 05:09:56AM -0800, Henrik Bengtsson wrote:
> See devSet() in R.utils, e.g.
> 
> > library("R.utils");
> > graphics.off();
> > devSet(6);
> > devSet("myFigure");  # Also possible
> > print(devList());
> myFigure Device 6
>26

The function devSet() works fine and i will use it. Thank you
very much for this information.

Petr Savicky.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Wait for user input with readline()

2010-12-07 Thread Petr Savicky
On Mon, Dec 06, 2010 at 08:25:39AM -0800, Alexandre wrote:
> 
> Hi,
> 
> I have a similar problem as the one of Nate. The point is that I want to
> design an interactive script that need the value of two variables (x and y).
> 
> So my script as designed for the moment is :
> 
> x <- as.numeric (readline(prompt="What is the value of x? "))
> y <- as.numeric (readline(prompt="What is the value of y? "))
> 
> x
> y 
> 
> But the problem is that if I run this script, values returned for x and y
> will be "NA" like you can see below :

How do you call your code? Function readline() does not wait for
user input in a non-interactive session, for example R CMD BATCH
or Rscript.

Another situation, when readline() does not wait is, when you copy
a block of code and paste it to a running session, even if it is
interactive. If readline() is not the last line of the code, then
the next line of code is used instead of the user input.

Petr Savicky.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] print(...,digits=2) behavior

2011-02-07 Thread Petr Savicky
On Mon, Feb 07, 2011 at 09:56:18AM +0100, Martin Maechler wrote:
> >>>>> Ben Bolker 
> >>>>> on Sat, 5 Feb 2011 15:58:09 -0500 writes:
> 
> >   A bug was recently posted to the R bug database (which
> > probably would better have been posted as a query here) as
> > to why this happens:
> 
> >> print(7.921,digits=2)
> > [1] 8
> >> print(7.92,digits=2)
> > [1] 7.9
> 
> >   Two things I *haven't* done to help make sense of this
> > for myself are (1) writing out the binary representations
> > to see if something obvious pops out about why this would
> > be a breakpoint and (2) poking in the source code (I did a
> > little bit of this but gave up).
> 
> >   I know that confusion over rounding etc. is very common,
> > and my first reaction to this sort of question is always
> > that there must be some sort of confusion (either (1) in a
> > FAQ 7.31-ish sort of way that floating point values have
> > finite precision in the first place, or (2) a confusion
> > over the difference between the value and the
> > representation of the number, or (3) more subtly, about
> > the differences among various choices of rounding
> > conventions).
> 
> >   However, in this case I am a bit stumped: I don't see
> > that any of the standard confusions apply.  I grepped the
> > R manuals for "rounding" and didn't find anything useful
> > ...  I have a very strong prior that such a core part of R
> > must be correct, and that therefore I (and the original
> > bug reporter) must be misunderstanding something.
> 
> >   Thoughts/references?
> 
> I had started to delve into this after you've posted the bug
> report. It is clearly a bug(let),
> caused by code that has been in  R  from its very
> beginning, at least in the first source code I've seen in 1994.
> 
> The problem is not related to digits=2,
> but using such a low number of digits shows it more
> dramatically, e.g., also
> 
>  > print(5.9994, digits=4)
>  [1] 5.999
>  > print(5.9994001, digits=4)
>  [1] 6

I think that this may be a consequence of the following
analysis, which i did some time ago using the source code
of scientific() function in src/main/format.c. The
conclusion was the following.

Printing to k digits looks for the smallest number of
digits l <= k so that the relative error of the printed
mantissa (significand) is at most 10^-k. If the mantissa
begins with a digit less than 5, then this condition is
stronger than having absolute error at most 5*10^-k. So,
in this case, we cannot get lower accuracy than rounding
to k digits.

If the mantissa begins with 5 or more, then having relative
error at most 10^-k is a weaker condition than having absolute
error at most 5*10^-k. In this case, the chosen l may be
smaller than k even if the printed number has larger error
than the number rounded to k digits.

This may be seen in the following example

  xx <- 8 + (6:10)*10^-7
  for (x in xx) print(x)

  [1] 8
  [1] 8
  [1] 8
  [1] 8.01
  [1] 8.01

where all the printed numbers are 8.01 if rounded
to 7 digits.

The cases 

  print(7.921,digits=2)
  [1] 8

  print(7.92,digits=2)
  [1] 7.9

seem to have the same explanation. The relative errors of the
numbers rounded to a single digit are

  (8 - 7.921)/7.921
  [1] 0.009973488

  (8 - 7.92)/7.92
  [1] 0.01010101

In the first case, this is less than 10^-2 and so, l = 1
is used. In the second case, the relative error for l = 1
is larger than 10^-2 an so, l = 2 is chosen.

In the cases

  print(5.9994, digits=4)
  [1] 5.999
  print(5.9994001, digits=4)
  [1] 6

the relative errors of one digit output are

  (6 - 5.9994)/5.9994
  [1] 0.00010001
  
  (6 - 5.9994001)/5.9994001
  [1] 9.999333e-05

Here, one digit output is chosen if the relative error is
less than 10^-4 and not otherwise.

Petr Savicky.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] print(...,digits=2) behavior

2011-02-09 Thread Petr Savicky
On Mon, Feb 07, 2011 at 09:56:18AM +0100, Martin Maechler wrote:
> >>>>> Ben Bolker 
> >>>>> on Sat, 5 Feb 2011 15:58:09 -0500 writes:
> 
> >   A bug was recently posted to the R bug database (which
> > probably would better have been posted as a query here) as
> > to why this happens:
> 
> >> print(7.921,digits=2)
> > [1] 8
> >> print(7.92,digits=2)
> > [1] 7.9
> 
[...]

> I had started to delve into this after you've posted the bug
> report. It is clearly a bug(let),
> caused by code that has been in  R  from its very
> beginning, at least in the first source code I've seen in 1994.
> 
> The problem is not related to digits=2,
> but using such a low number of digits shows it more
> dramatically, e.g., also
> 
>  > print(5.9994, digits=4)
>  [1] 5.999
>  > print(5.9994001, digits=4)
>  [1] 6
> 
> Interestingly, the problem seems *not* to be present for
> digits = 1 (only).
> 
> I haven't found time to mathematically "analyze" it for
> determining a correct solution though.
> Note that fixing this bug(let) will probably (very slightly)
> change a huge number of R outputs .. so there is a caveat,
> but nonetheless, we must approach it.
> 
> The responsible code is the C function  scientific()
> in src/main/format.c 

I inspected the source of scientific() and formatReal() (2.13.0, revision
2011-02-08 r54284). Let me point out an example of the difference between
the output of print() and rounding to "digits" significant digits, which
is slightly different from the previous ones, since also the exponent
changes. Namely,

  print(9.991, digits=3)
  [1] 10

while rounding to 3 digits yields 9.99. The reason is in scientific(),
where the situation that rounding increases the exponent is tested using

/* make sure that alpha is in [1,10) AFTER rounding */

if (10.0 - alpha < eps*alpha) {
alpha /= 10.0;
kp += 1;
}

Here, eps is determined in formatReal() as

double eps = pow(10.0, -(double)R_print.digits);

so we have eps = 10^-digits. The above condition on alpha is equivalent to

  alpha > 10.0/(1 + 10^-digits)

For digits=3, this is

  alpha > 9.99000999000999

This bound may be verified as follows

  print(9.9900099900, digits=3)
  [1] 9.99

  print(9.9900099901, digits=3)
  [1] 10

The existing algorithm for choosing the number of digits is designed to
predict the format suitable for all numbers in a vector before the actual
call of sprintf() or snprintf(). For speed, this algorithm should use
the standard double precision, so it is necessarily inaccurate for precision
15 and more and there may be some rare such cases also for smaller
precisions. For smaller precisions, say below 7, the algorithm can be made
more precise. This would change the output in rare cases and mainly for
printing single numbers. If a vector is printed, then the format is typically
not determined by the problematic numbers.

Changing the default behavior may be undesirable for backward compatibility
reasons. If this is the case, then a possible solution is to make the
documentation more precise on this and include pointers to possible
solutions. The functions sprintf(), formatC() and signif() may be used. In
particular, if signif() is used to round the numbers before printing, then we
get the correct output

  print(signif(7.921, digits=2))  
  [1] 7.9

  print(signif(9.9900099901, digits=3))
  [1] 9.99

The current ?print.default contains

  digits: a non-null value for ‘digits’ specifies the minimum number of
  significant digits to be printed in values.

The word "minimum" here means that all numbers in the vector will have
at least the chosen number of digits, but some may have more. I suggest
to add "See 'options' for more detail.".

The current ?options contains

‘digits’: controls the number of digits to print when printing
  numeric values.  It is a suggestion only.

I suggest to extend this to

   It is a suggestion only and the actual number of printed digits
   may be smaller, if the relative error of the output number is
   less than 10^-digits. Use 'signif(, digits)' before printing to get
   the exact number of the printed significant digits.

I appreciate to know the opinion of R developers on this.

Petr Savicky.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] R 2.12.1 Windows 32bit and 64bit - are numerical differences expected?

2011-02-10 Thread Petr Savicky
On Thu, Feb 10, 2011 at 10:37:09PM +1100, Graham Williams wrote:
> Should one expect minor numerical differences between 64bit and 32bit R on
> Windows? Hunting around the lists I've not been able to find a definitive
> answer yet. Seems plausible using different precision arithmetic, but waned
> to confirm from those who might know for sure.

One of the sources for the difference between platforms are different
settings of the compiler. On Intel processors, the options may influence,
whether the registers use 80 bit or 64 bit representation of floating
point numbers. In memory, it is always 64 bit. Testing, whether there is
a difference between registers and memory may be done for example using
the code

  #include 
  #define n 3
  int main(int agc, char *argv[])
  {
  double x[n];
  int i;
  for (i=0; ihttps://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] print(...,digits=2) behavior

2011-02-13 Thread Petr Savicky
On Mon, Feb 07, 2011 at 09:56:18AM +0100, Martin Maechler wrote:
[...]
> Namely, one important purpose of the test is to ensure that e.g.
> 
>   print(3.597, digits = 3)
> 
> is printed as  3.6   and not 3.60
> 
> Now I have had -- since 1997 at least -- an R version of
> scientific() for more easy experimentation.
> Here's an updated version of that:

[...]
> 
> Now, I'm waiting for comments/suggestions ..

In a previous email, i discussed the case print(9.991, digits=3).
This case is already handled in your R level experimental function
scientific(). I noticed this only after sending the previous email.

I would like to suggest a modification of the algorithm. The purpose
is to determine the minimum number of digits, such that the printed number
has the same value as the number rounded to exactly getOption("digits")
digits, but shorter representation.

Algorithm. First, the mantissa of the number rounded to "digits" digits
is computed as an integer by the formula

  a <- floor(alpha*10^(digits - 1) + 0.5)

The obtained integer number a satisfies

  10^(digits-1) <= a <= 10^digits

The case a = 10^digits corresponds to the situation that we have to
increase the exponent. This may be tested either before the remaining
part of the code or as the last step as below.

For example, if alpha = 3.597 and digits = 3, we get a = 360. The
question, whether (digits - 1) digits are sufficient for printing the
number, is equivalent to the condition that "a" is divisible by 10.
Similarly, (digits - 2) digits are sufficient if and only if "a" is
divisible by 100. This may be tested using the following code

nsig <- digits
for (j in 1:nsig) {
a <- a / 10
if (a == floor(a)) {
nsig <- nsig - 1
} else {
break
}
}
## nsig == 0 if and only if we started with a == 10^digits
if (nsig == 0) {
nsig <- 1
kpower <- kpower + 1
}

This code uses double precision for a, since values up to 10^digits
may occur and digits may be 15 or more. The algorithm is not exact
for digits = 15, but works reasonably. I suggest to use a different,
slower and more accurate algorithm, if getOption("digits") >= 16.

Please, find in an attachment two versions of the above algorithm. One
is a modification of your R level function from scientific.R and the
other is a patch against src/main/format.c, which i tested.

I suggest to consider the following test cases. The presented output
is obtained using the patch.

  example1 <- c(
  7.94,
  7.950001,
  8.04,
  8.050001)
  for (x in example1) print(x, digits=2)
  
  [1] 7.9
  [1] 8
  [1] 8
  [1] 8.1
  
  example2 <- c(
  3.599499,
  3.599501,
  3.600499,
  3.600501)
  for (x in example2) print(x, digits=4)
  
  [1] 3.599
  [1] 3.6
  [1] 3.6
  [1] 3.601
  
  example3 <- c(
  1.004999,
  1.005001)
  for (x in example3) print(x, digits=7)
  
  [1] 1
  [1] 1.01
  
  example4 <- c(
  9.994999,
  9.995001)
  for (x in example4) print(x, digits=7)
  
  [1] 9.99
  [1] 10

I appreciate comments.

Petr Savicky.

###--- R function that does the same as 'scientific' in

###--- /usr/local/R-0.50/src/main/format.c
###--- ~~~||~~

scientific1 <- function(x, digits = getOption('digits')) ## eps = 10^-(digits)
{
  ##-   /* for 1 number  x , return
  ##-*  sgn= 1_{x < 0}  {0/1}
  ##-*  kpower = Exponent of 10;
  ##-*  nsig   = min(digits, #{significant digits of alpha})
  ##-*
  ##-* where  |x| = alpha * 10^kpower   and  1 <= alpha < 10
  ##-*/

  eps <- 10 ^(-digits)

  if (x == 0) {
kpower <- 0
nsig <- 1
sgn <- 0
  } else {
if(x < 0) {
  sgn <- 1; r <- -x
} else {
  sgn <- 0; r <- x
}
kpower <- floor(log10(r));##-->  r = |x| ;  10^k <= r

if (kpower <= -308) ## close to denormalization -- added for R 2.x.y
alpha <- (r * 1e+30) / 10^(kpower+30)
else
alpha <- r / 10^kpower

## "a" integer, 10^(digits-1) <= a <= 10^digits

a <- floor(alpha*10^(digits - 1) + 0.5)
nsig <- digits
for (j in 1:nsig) {
a <- a / 10
if (a == floor(a)) {
nsig <- nsig - 1
} else {
break
}
}
if (nsig == 0) {
nsig <- 1
kpower <- kpower + 1
}
  }
  left <- kpower + 1
  c(sgn = sgn, kpower = kpower, nsig = nsig,
left = left, right = nsig - left,
sleft = sgn + max(1,left))
}

diff --minimal -U 5 -r R-devel/src/main/format.c R-devel-print/src/main/format.c
--- R-devel/src/main/format.c   2010-04-27 17:52:24.0 +0200
+++ R-devel-print/src/main/format.c 201

Re: [Rd] R command line and pipe using in Linux?

2011-02-14 Thread Petr Savicky
On Mon, Feb 14, 2011 at 05:40:29PM +, Hang PHAN wrote:
> Hi,
> I have a very large data file(GB) from which I only want to extract one
> column to draw histogram. This would be done several times, so I would like
> to ask if there is anyway to plot this using R from the linux command line,
> something look like this
> 
> cut -f1 xxx.txt |RplotHist 

Hi Hang:

Can you use something like the following?

  x <- as.numeric(system("cut -f1 xxx.txt", intern=TRUE))

According to ?system, long lines will be split, however, no limit
on the number of lines of the output is formulated there.

Petr Savicky.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] error in memCompress() in make check

2011-02-22 Thread Petr Savicky
Dear R developers:

When i run "make check" in R-patched_2011-02-12 and R-devel_2011-02-22
on a specific machine, the test fails and the file

  tests/Examples/base-Ex.Rout.fail

ends with

  > txt.xz <- memCompress(txt, "x")
  Error in memCompress(txt, "x") : internal error 5 in memCompress
  Execution halted

The error may be reproduced using commands

  txt <- readLines(file.path(R.home("doc"), "COPYING"))
  txt.xz <- memCompress(txt, "x")

in both the above installed versions. The machine is CentOS
release 5.4 (Final) under VMware.

I did not observe this error on other machines, some of which are
also CentOS.

For the development version, sessionInfo() is

  R version 2.13.0 Under development (unstable) (2011-02-22 r54523)
  Platform: i686-pc-linux-gnu (32-bit)
  
  locale:
   [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C  
   [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
   [5] LC_MONETARY=C  LC_MESSAGES=en_US.UTF-8   
   [7] LC_PAPER=en_US.UTF-8   LC_NAME=C 
   [9] LC_ADDRESS=C   LC_TELEPHONE=C
  [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C   
  
  attached base packages:
  [1] stats graphics  grDevices utils datasets  methods   base     

gcc --version

  gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-46)

Petr Savicky.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] error in memCompress() in make check

2011-02-22 Thread Petr Savicky
On Tue, Feb 22, 2011 at 11:25:50AM +, Prof Brian Ripley wrote:
> So it seems that there is something wrong with the liblzma library 
> used on that machine.  Did it use the version supplied with R or an 
> external library (which is the default if one is found)?  My first 
> step would be to force the internal version via --with-system-xz=no.

Thank you for your reply.

I tried 

  ./configure --with-x=no --with-system-xz=no

in a clean R-devel_2011-02-22 and the result of make check is the same.

The commands

  txt <- readLines(file.path(R.home("doc"), "COPYING"))
  txt.xz <- memCompress(txt, "x")

do not produce an error, if the compiled R runs in the same shell,
where "make check" was run. However, they produce the error, if R is
started in a new shell.

The command

  find /usr -name "liblzma*"

has empty output.

Petr Savicky.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] error in memCompress() in make check

2011-02-22 Thread Petr Savicky
On Tue, Feb 22, 2011 at 01:38:07PM +0100, Petr Savicky wrote:
...
> The commands
> 
>   txt <- readLines(file.path(R.home("doc"), "COPYING"))
>   txt.xz <- memCompress(txt, "x")
> 
> do not produce an error, if the compiled R runs in the same shell,
> where "make check" was run. However, they produce the error, if R is
> started in a new shell.

Athough i did see the above two lines with no error on my screen, the
change of the behavior is not reproducible. I am sorry, i probably
mixed up windows or something.

According to a repeated test, the above two lines produce the error

  Error in memCompress(txt, "x") : internal error 5 in memCompress

on the machine described in the first email, even if --with-system-xz=no
was used for configuration.

Petr Savicky.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Anomaly with unique and match

2011-03-09 Thread Petr Savicky
On Wed, Mar 09, 2011 at 08:48:10AM -0600, Terry Therneau wrote:
> I stumbled onto this working on an update to coxph.  The last 6 lines
> below are the question, the rest create a test data set.
> 
> tmt585% R
> R version 2.12.2 (2011-02-25)
> Copyright (C) 2011 The R Foundation for Statistical Computing
> ISBN 3-900051-07-0
> Platform: x86_64-unknown-linux-gnu (64-bit)
> 
> # Lines of code from survival/tests/singtest.R
> > library(survival)
> Loading required package: splines
> > test1 <- data.frame(time=  c(4, 3,1,1,2,2,3),
> + status=c(1,NA,1,0,1,1,0),
> + x= c(0, 2,1,1,1,0,0))
> > 
> > temp <- rep(0:3, rep(7,4))
> > 
> > stest <- data.frame(start  = 10*temp,
> + stop   = 10*temp + test1$time,
> + status = rep(test1$status,4),
> + x  = c(test1$x+ 1:7, rep(test1$x,3)),
> + epoch  = rep(1:4, rep(7,4)))
> > 
> > fit1 <- coxph(Surv(start, stop, status) ~ x * factor(epoch), stest)
> 
> ## New lines
> > temp1 <- fit1$linear.predictor
> > temp2 <- as.matrix(temp1)
> > match(temp1, unique(temp1))
>  [1] 1 2 3 4 4 5 6 7 7 7 6 6 6 8 8 8 6 6 6 9 9 9 6 6
> > match(temp2, unique(temp2))
>  [1]  1  2  3  4  4  5  6  7  7  7  6  6  6 NA NA NA  6  6  6  8  8  8
> 6  6
> 
> ---
> 
>  I've solved it for my code by not calling match on a 1 column vector.  
>  In general, however, should I be using some other paradym for this "map
> to unique" operation?  For example match(as.character(x),
> unique(as.character(x)) ?

Let me suggest an alternative, which is consistent with unique() on
numeric vectors and uses a transformation of the column using rank().
For example,

  temp3 <- as.matrix(rank(temp1, ties.method="max"))
  match(temp3, unique(temp3))

  [1] 1 2 3 4 4 5 6 7 7 7 6 6 6 8 8 8 6 6 6 9 9 9 6 6

Can this be used in your code?

Petr Savicky.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] unique.matrix issue [Was: Anomaly with unique and match]

2011-03-10 Thread Petr Savicky
On Wed, Mar 09, 2011 at 02:11:49PM -0500, Simon Urbanek wrote:
> match() is a red herring here -- it is really a very specific thing that has 
> to do with the fact that you're running unique() on a matrix. Also it's much 
> easier to reproduce:
> 
> > x=c(1,1+0.2e-15)
> > x
> [1] 1 1
> > sprintf("%a",x)
> [1] "0x1p+0"   "0x1.1p+0"
> > unique(x)
> [1] 1 1
> > sprintf("%a",unique(x))
> [1] "0x1p+0"   "0x1.1p+0"
> > unique(matrix(x,2))
>  [,1]
> [1,]1
>  
> and this comes from the fact that unique.matrix uses string representation 
> since it has to take into account all values of a row/column so it pastes all 
> values into one string, but for the two numbers that is the same:
> > as.character(x)
> [1] "1" "1"

I understand the use of match() in the original message by Terry Therneau
as an example of a situation, where the behavior of unique.matrix() becomes
visible even without looking at the last bits of the numbers.

Let me suggest to consider the following example.

  x <- 1 + c(1.1, 1.3, 1.7, 1.9)*1e-14
  a <- cbind(rep(x, each=2), 2)
  rownames(a) <- 1:nrow(a)

The correct set of rows may be obtained using

  unique(a - 1)

[,1] [,2]
  1 1.110223e-141
  3 1.310063e-141
  5 1.709743e-141
  7 1.909584e-141

However, due to the use of paste(), which uses as.character(), in
unique.matrix(), we also have

  unique(a)

[,1] [,2]
  112
  512

Let me suggest to consider a transformation of the numeric columns
by rank() before the use of paste(). For example

  unique.mat <- function(a)
  {
  temp <- apply(a, 2, rank, ties.method="max")
  temp <- apply(temp, 1, function(x) paste(x, collapse = "\r"))
  a[!duplicated(temp), , drop=FALSE]
  }

  unique.mat(a)

[,1] [,2]
  112
  312
  512
  712

Petr Savicky.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] unique.matrix issue [Was: Anomaly with unique and match]

2011-03-12 Thread Petr Savicky
On Thu, Mar 10, 2011 at 01:19:48AM -0800, Henrik Bengtsson wrote:
> It should be possible to run unique()/duplicated() column by column
> and incrementally update the set of unique/duplicated rows.  This
> would avoid any coercing.  The benefit should be even greater for
> data.frame():s.

This is a good point. An implementation of this using sorting can
be done as follows

  Sort the data frame using function order().
  Determine the groups of consecutive equal rows in the sorted df.
  Map the first row of each group to the original order of the rows.
  Since sorting by the function order() is stable, we obtain the first
  in each group of equal rows also in the original order.

The coercion approach uses hashing for string comparison, but the 
efficiency of hashing seems to be overweighted by the inefficiency
of the coercion. So, we get the following comparison.

  a <- matrix(sample(c(1234, 5678), 12*1, replace=TRUE), ncol=12)
  df <- data.frame(a)
  
  do.unique.sort <- function(df)
  {
i <- do.call(order, df)
n <- nrow(df)
u <- c(TRUE, rowSums(df[i[2:n], ] == df[i[1:(n-1)], ]) < ncol(df))
df[u[order(i)], ]
  }

  system.time(out1 <- do.unique.sort(df))
  system.time(out2 <- unique(df))
  identical(out1, out2)

The result may be, for example

 user  system elapsed 
0.279   0.000   0.273 
 user  system elapsed 
0.514   0.000   0.468 
  [1] TRUE

On another computer

 user  system elapsed 
0.058   0.000   0.058 
 user  system elapsed 
0.187   0.000   0.188 
  [1] TRUE

On Thu, Mar 10, 2011 at 01:39:56PM -0600, Terry Therneau wrote:
> Simon pointed out that the issue I observed was due to internal
> behaviour of unique.matrix.
> 
>   I had looked carefully at the manual pages before posting the question
> and this was not mentioned.  Perhaps an addition could be made?

According to the description of unique(), the user may expect that if
b is obtained using

  b <- unique(a)

then for every "i" there is "j", such that

  all(a[i, ] == b[j, ])

This is usually true, but not always, because among several numbers
in "a" with the same as.character() only one remains in "b". If this
is intended, then i support the suggestion to include a note in the
documentation.

Let me add an argument against using as.character() to determine,
whether two numbers are close. The maximum relative difference between
the numbers, which have the same 15 digit decimal representation, varies
by a factor up to 10 in different ranges. Due to this, we have

  x <- 1 + c(1.1, 1.3, 1.7, 1.9)*1e-14

  unique(as.character(x))
  [1] "1.01" "1.02"

  unique(as.character(9*x))
  [1] "9.1"  "9.12" "9.15" 
"9.17"

The relative differences between components of 9*x are the same as the
relative differences in x, but if the mantissa begins with 9, then
a smaller relative difference is sufficient to change 15-th digit.

In terms of unique(), this implies

  nrow(unique(cbind(x)))
  [1] 2

  nrow(unique(cbind(9*x)))
  [1] 4
  
Petr Savicky.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] round, unique and factor

2011-03-21 Thread Petr Savicky
On Mon, Mar 21, 2011 at 10:46:41AM -0500, Terry Therneau wrote:
>  Survfit had a bug in some prior releases due to the use of both
> unique(times) and table(times); I fixed it by rounding to 15 digits per
> the manual page for as.character.  Yes, I should ferret out all the
> usages instead, but this was fast and it cured the user's problem.
>   The bug is back!  A data set from a local colleage triggers it.  
> I can send the rda file to anyone who wishes.  
> 
>  The current code has
>  digits <- floor((.Machine$double.digits) * 
> logb(.Machine$double.base,10)) #base 10 digits
>  Y[,1] <- signif(Y[,1], digits)
> 
> which gives 15 digits; should I subtract one more?
>   
> Should the documentation change?
> 
> In the meantime I'm looking at the more permanent fix of turning time
> into a factor, then back at the very end.  Because it is a bigger change
> the potential for breakage is higer, however.

Can you describe the error in more detail? Is it related to consistency
of converting a number to character and back?

Petr Savicky.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] testing presence of pdflatex in R CMD check

2011-03-25 Thread Petr Savicky
The nodes of our cluster do not have pdflatex installed. When
running R CMD check there on a package with no errors in documentation,
then R-2.13.0-alpha and R-2.12.2 report a possible error in Rd files,
while R-2.11.1 did not. The platform is 64 bit CentOS.

The output of

  R CMD check tree_1.0-28.tar.gz

under the above three versions of R contains the following.

R-2.11.1

stderr

  sh: pdflatex: command not found
  sh: latex: command not found

stdout

  * checking for working pdflatex ... NO

R-2.12.2

stderr

  sh: pdflatex: command not found
  Error in texi2dvi("Rd2.tex", pdf = (out_ext == "pdf"), quiet = FALSE,  :
unable to run 'pdflatex' on 'Rd2.tex'
  Error in running tools::texi2dvi

stdout

  * checking PDF version of manual ... WARNING
  LaTeX errors when creating PDF version.
  This typically indicates Rd problems.
  * checking PDF version of manual without hyperrefs or index ... ERROR
  Re-running with no redirection of stdout/stderr.
  Hmm ... looks like a package
  You may want to clean up by 'rm -rf /tmp/RtmpgnWKRW/Rd2pdf3723367e'

2.13.0-alpha

stderr

  Error in texi2dvi("Rd2.tex", pdf = (out_ext == "pdf"), quiet = FALSE,  :
pdflatex is not available
  Error in running tools::texi2dvi

stdout

  * checking PDF version of manual ... WARNING
  LaTeX errors when creating PDF version.
  This typically indicates Rd problems.
  * checking PDF version of manual without hyperrefs or index ... ERROR
  Re-running with no redirection of stdout/stderr.
  Hmm ... looks like a package
  You may want to clean up by 'rm -rf /tmp/RtmpQ0WawT/Rd2pdf41481e1'

Is it intentional not to test the presence of pdflatex during R CMD check?

Petr Savicky.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] duplicates() function

2011-04-09 Thread Petr Savicky
On Fri, Apr 08, 2011 at 10:59:10AM -0400, Duncan Murdoch wrote:
> I need a function which is similar to duplicated(), but instead of 
> returning TRUE/FALSE, returns indices of which element was duplicated.  
> That is,
> 
> > x <- c(9,7,9,3,7)
> > duplicated(x)
> [1] FALSE FALSE  TRUE FALSE TRUE
> 
> > duplicates(x)
> [1] NA NA  1 NA  2
> 
> (so that I know that element 3 is a duplicate of element 1, and element 
> 5 is a duplicate of element 2, whereas the others were not duplicated 
> according to our definition.)
> 
> Is there a simple way to write this function?

A possible strategy is to use sorting. In a sorted matrix
or data frame, the elements, which are duplicates of the
same element, form consecutive blocks. These blocks may
be identified using !duplicated(), which determines the
first elements of these blocks. Since sorting is stable,
when we map these blocks back to the original order, the
first element of each block is mapped to the first ocurrence
of the given row in the original order.

An implementation may be done as follows.

  duplicates <- function(dat)
  {
  s <- do.call("order", as.data.frame(dat))
  non.dup <- !duplicated(dat[s, ])
  orig.ind <- s[non.dup]
  first.occ <- orig.ind[cumsum(non.dup)]
  first.occ[non.dup] <- NA
  first.occ[order(s)]
  }
 
  x <-  cbind(1, c(9,7,9,3,7) )
  duplicates(x)
  [1] NA NA  1 NA  2

The line

  orig.ind <- s[non.dup]

creates a vector, whose length is the number of non-duplicated
rows in the sorted "dat". Its components are indices of the
corresponding first occurrences of these rows in the original
order. For this, the stability of the order is needed.

The lines

  first.occ <- orig.ind[cumsum(non.dup)]
  first.occ[non.dup] <- NA

expand orig.ind to a vector, which satisfies: If i-th row of the
sorted "dat" is duplicated, then first.occ[i] is the index of the
first row in the original "dat", which is equal to this row. So, the
values in first.occ are those, which are required for the output
of duplicates(), but they are in the order of the sorted "dat". The
last line 

  first.occ[order(s)]

reorders the vector to the original order of the rows.

Petr Savicky.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] duplicates() function

2011-04-12 Thread Petr Savicky
On Mon, Apr 11, 2011 at 02:05:11PM -0400, Duncan Murdoch wrote:
> On 08/04/2011 11:39 AM, Joshua Ulrich wrote:
> >On Fri, Apr 8, 2011 at 10:15 AM, Duncan Murdoch
> >  wrote:
> >>  On 08/04/2011 11:08 AM, Joshua Ulrich wrote:
> >>>
> >>>  How about:
> >>>
> >>>  y<- rep(NA,length(x))
> >>>  y[duplicated(x)]<- match(x[duplicated(x)] ,x)
> >>
> >>  That's a nice solution for vectors.  Unfortunately for me, I have a 
> >matrix
> >>  (which duplicated() handles by checking whole rows).  So a better 
> >example
> >>  that I should have posted would be
> >>
> >>  x<-  cbind(1, c(9,7,9,3,7) )
> >>
> >>  and I'd still like the same output
> >>
> >For a matrix, could you apply the same strategy used in duplicated()?
> >
> >y<- rep(NA,NROW(x))
> >temp<- apply(x, 1, function(x) paste(x, collapse="\r"))
> >y[duplicated(temp)]<- match(temp[duplicated(temp)], temp)
> 
> Since this thread hasn't ended, I will say that I think this solution is 
> the best I've seen for my specific problem.  I was actually surprised 
> that duplicated() did the string concatenation trick, but since it does, 
> it makes a lot of sense to do the same in duplicates().

Consistency with duplicated() is a good argument.

Let me point out, although it goes beyond the original question, that
sorting may be used to compute duplicated() in a way, which is more
efficient than the paste() approach according to the test below.

  duplicatedSort <- function(df)
  {
  n <- nrow(df)
  if (n == 1) {
  return(FALSE)
  } else {
  s <- do.call(order, as.data.frame(df))
  equal <- df[s[2:n], , drop=FALSE] == df[s[1:(n-1)], , drop=FALSE]
  dup <- c(FALSE, rowSums(equal) == ncol(df))
  return(dup[order(s)])
  }
  }

The following tests efficiency for a character matrix.
 
  m <- 1000
  n <- 4
  a <- matrix(as.character(sample(10, m*n, replace=TRUE)), nrow=m, ncol=n)
  system.time(out1 <- duplicatedSort(a))
  system.time(out2 <- duplicated(a))
  identical(out1, out2)
  table(out1)

I obtained, for example,

 user  system elapsed 
0.003   0.000   0.003 
  
 user  system elapsed 
0.012   0.000   0.011 
  
  [1] TRUE

  out1
  FALSE  TRUE 
94258 

For a numeric matrix, the ratio of the running times is larger in
the same direction.

Petr Savicky.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Dangerous Bug with IF function of R

2011-04-18 Thread Petr Savicky
On Mon, Apr 18, 2011 at 09:12:41AM -0700, salmajj wrote:
> hi!
> there is a bug with the IF operator that is really dangerous!
> please try the code below and if someone could explain to me why when (q is
> equal to 0.8, 0.9 or 1) R do not print it?
> 
> q=0
> for (j in 1:11){
> 
> if ((q==1)){
> print(q)
>   }   
> q=q+0.1
> }
> 
> so in this code q is incremented from 0 to 1.1. but R do not capture print
> the value 1, 0.8 and 0.9 !! try to change (q==0.4) it gonna print 0.4. but
> if you put q==0.8 or 0.9 or 1 it doesn't work!!!
> please try it it is amazing!!!

Incrementing a number by 0.1 produces numbers, which are not exactly
representable in binary, so this operation involves a rounding error.
Try the following

  q=0
  for (j in 1:11){
if ((q==1)){
   print(q)
}     
q=round(q+0.1, digits=7)
  }

Petr Savicky.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] matrix not positive definite (while it should be)

2011-05-06 Thread Petr Savicky
On Thu, May 05, 2011 at 02:31:59PM -0400, Arthur Charpentier wrote:
> I do have some trouble with matrices. I want to build up a covariance matrix
> with a hierarchical structure). For instance, in dimension n=10, I have two
> subgroups (called REGION).
> 
> NR=2; n=10
> CORRELATION=matrix(c(0.4,-0.25,
>  -0.25,0.3),NR,NR)
> REGION=sample(1:NR,size=n,replace=TRUE)
> R1=REGION%*%t(rep(1,n))
> R2=rep(1,n)%*%t(REGION)
> SIGMA=matrix(NA,n,n)
> 
> for(i in 1:NR){
> for(j in 1:NR){
> SIGMA[(R1==i)&(R2==j)]=CORRELATION[i,j]
> }}
> 
> If I run quickly some simulations, I build up the following matrix
> 
> > CORRELATION
>   [,1]  [,2]
> [1,]  0.40 -0.25
> [2,] -0.25  0.30
> > REGION
>  [1] 2 2 1 1 2 1 2 1 1 2
> > SIGMA
>[,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10]
>  [1,]  0.30  0.30 -0.25 -0.25  0.30 -0.25  0.30 -0.25 -0.25  0.30
>  [2,]  0.30  0.30 -0.25 -0.25  0.30 -0.25  0.30 -0.25 -0.25  0.30
>  [3,] -0.25 -0.25  0.40  0.40 -0.25  0.40 -0.25  0.40  0.40 -0.25
>  [4,] -0.25 -0.25  0.40  0.40 -0.25  0.40 -0.25  0.40  0.40 -0.25
>  [5,]  0.30  0.30 -0.25 -0.25  0.30 -0.25  0.30 -0.25 -0.25  0.30
>  [6,] -0.25 -0.25  0.40  0.40 -0.25  0.40 -0.25  0.40  0.40 -0.25
>  [7,]  0.30  0.30 -0.25 -0.25  0.30 -0.25  0.30 -0.25 -0.25  0.30
>  [8,] -0.25 -0.25  0.40  0.40 -0.25  0.40 -0.25  0.40  0.40 -0.25
>  [9,] -0.25 -0.25  0.40  0.40 -0.25  0.40 -0.25  0.40  0.40 -0.25
> [10,]  0.30  0.30 -0.25 -0.25  0.30 -0.25  0.30 -0.25 -0.25  0.30

Hi.

If X is a random vector from the 2 dimensional normal distribution
with the covariance matrix

[,1]  [,2]
  [1,]  0.40 -0.25
  [2,] -0.25  0.30

then the vector X[REGION], which consists of replicated components
of X, has the expanded covariance matrix n times n, which you ask
for. Since the mean and the covariance matrix determine the distribution
uniquely, this is also a description of the required distribution.

The distribution is concentrated in a 2 dimensional subspace, since
the covariance matrix has rank 2.

Hope this helps.

Petr Savicky.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] portable parallel seeds project: request for critiques

2012-02-17 Thread Petr Savicky
On Fri, Feb 17, 2012 at 02:57:26PM -0600, Paul Johnson wrote:
> I've got another edition of my simulation replication framework.  I'm
> attaching 2 R files and pasting in the readme.
> 
> I would especially like to know if I'm doing anything that breaks
> .Random.seed or other things that R's parallel uses in the
> environment.
> 
> In case you don't want to wrestle with attachments, the same files are
> online in our SVN
> 
> http://winstat.quant.ku.edu/svn/hpcexample/trunk/Ex66-ParallelSeedPrototype/
> 
> 
> ## Paul E. Johnson CRMDA 
> ## Portable Parallel Seeds Project.
> ## 2012-02-18
> 
> Portable Parallel Seeds Project
> 
> This is how I'm going to recommend we work with random number seeds in
> simulations. It enhances work that requires runs with random numbers,
> whether runs are in a cluster computing environment or in a single
> workstation.
> 
> It is a solution for two separate problems.
> 
> Problem 1. I scripted up 1000 R runs and need high quality,
> unique, replicable random streams for each one. Each simulation
> runs separately, but I need to be confident their streams are
> not correlated or overlapping. For replication, I need to be able to
> select any run, say 667, and restart it exactly as it was.
> 
> Problem 2. I've written a Parallel MPI (Message Passing Interface)
> routine that launches 1000 runs and I need to assure each has
> a unique, replicatable, random stream. I need to be able to
> select any run, say 667, and restart it exactly as it was.
> 
> This project develops one approach to create replicable simulations.
> It blends ideas about seed management from John M. Chambers
> Software for Data Analysis (2008) with ideas from the snowFT
> package by Hana Sevcikova and Tony R. Rossini.
> 
> 
> Here's my proposal.
> 
> 1. Run a preliminary program to generate an array of seeds
> 
> run1:   seed1.1   seed1.2   seed1.3
> run2:   seed2.1   seed2.2   seed2.3
> run3:   seed3.1   seed3.2   seed3.3
> ...  ...   ...
> run1000   seed1000.1  seed1000.2   seed1000.3
> 
> This example provides 3 separate streams of random numbers within each
> run. Because we will use the L'Ecuyer "many separate streams"
> approach, we are confident that there is no correlation or overlap
> between any of the runs.
> 
> The projSeeds has to have one row per project, but it is not a huge
> file. I created seeds for 2000 runs of a project that requires 2 seeds
> per run.  The saved size of the file 104443kb, which is very small. By
> comparison, a 1400x1050 jpg image would usually be twice that size.
> If you save 10,000 runs-worth of seeds, the size rises to 521,993kb,
> still pretty small.
> 
> Because the seeds are saved in a file, we are sure each
> run can be replicated. We just have to teach each program
> how to use the seeds. That is step two.

Hi.

Some of the random number generators allow as a seed a vector,
not only a single number. This can simplify generating the seeds.
There can be one seed for each of the 1000 runs and then,
the rows of the seed matrix can be

  c(seed1, 1), c(seed1, 2), ...
  c(seed2, 1), c(seed2, 2), ...
  c(seed3, 1), c(seed3, 2), ...
  ...

There could be even only one seed and the matrix can be generated as

  c(seed, 1, 1), c(seed, 1, 2), ...
  c(seed, 2, 1), c(seed, 2, 2), ...
  c(seed, 3, 1), c(seed, 3, 2), ...

If the initialization using the vector c(seed, i, j) is done
with a good quality hash function, the runs will be independent.

What is your opinion on this?

An advantage of seeding with a vector is also that there can
be significantly more initial states of the generator among
which we select by the seed than 2^32, which is the maximum
for a single integer seed.

Petr Savicky.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] portable parallel seeds project: request for critiques

2012-02-18 Thread Petr Savicky
On Fri, Feb 17, 2012 at 09:33:33PM -0600, Paul Johnson wrote:
[...]
> The seed things I'm using are the 6 integer values from the L'Ecuyer.
> If you run the example script, the verbose option causes some to print
> out.  The first 3 seeds in a saved project seeds file looks like:
> 
> > projSeeds[[1]]
> [[1]]
> [1] 407   376488316  1939487821  1433925148 -1040698333   579503880
> [7]  -624878918
> 
> [[2]]
> [1] 407 -1107332181   854177397  1773099324  1774170776  -266687360
> [7]   816955059
> 
> [[3]]
> [1] 407   936506900 -1924631332 -1380363206  2109234517  1585239833
> [7] -1559304513
> 
> The 407 in the first position is an integer R uses to note the type of
> stream for which the seed is intended, in this case R'Lecuyer.

The paralel streams uses MRG32k3a generator by P.L'Ecuyer, whose original
implementation stores its internal state as double precision floating point
numbers. The vector .Random.seed consists of integers. Do you know, whether
a modified implementation of MRG32k3a is used, which works with integers
internally, or the conversion between double and integer is done whenever
the state is stored to .Random.seed?
 
> I don't have any formal proof that a "good quality hash function"
> would truly create seeds from which independent streams will be drawn.

One of the suggested ways how to generate, say, 2^16 cryptographically secure
random numbers, is to use a counter producing a sequence 1, 2, 3, ..., 2^16
and transform these values by a hash function. An example is Fortuna generator

  http://en.wikipedia.org/wiki/Fortuna_(PRNG)

which is also available on CRAN as "randaes". The length of the sequence 
is limited, since the sequence contains no collisions. If the sequence is
too long, this could allow to distinguish it from truly random numbers.
After generating 2^16 numbers, the seed is recomputed and another 2^16
numbers are generated.

Such a generator is a very good one. It is not used for simulations, since it
is slower (say by a factor of 5) than the linear generators like Mersenne
Twister, WELL or MRG family of generators. However, if it is used only for
initialization, then the speed is less important.

> There is, however, the proof in the L'Ecuyer paper that one can take
> the long stream and divide it into sections.  That's the approach I'm
> taking here. Its the same approach the a parallel package in R
> follows, and parallel frameworks like snow.

According to

  http://www.iro.umontreal.ca/~lecuyer/myftp/papers/streams00.pdf

the sectioning of the stream to substreams is done by jump ahead algorithm.
The new seed is far enough in the sequence from the previous seed, so it
is guaranteed that the sequence generated from the previous seed is shorter
than the jump and the subsequences do not overlap. However, the new seed
is computable as a function of the previous seed. If we use this strategy
to produce a matrix of seeds

  s_{1,1}, s_{1,2}, ...
  s_{2,1}, s_{2,2}, ...
  s_{3,1}, s_{2,2}, ...

then each s_{i,j} may be computed from s_{1,1} and i, j. In this case, it is
sufficient to store s_{1,1}. If we know for each run the indices i,j, then
we can compute s_{i,j} by an efficient algorithm.

> The different thing in my approach is that I'm saving one row of seeds
> per simulation "run".  So each run can be replicated exactly.
> 
> I hope.

Saving .Random.seed should be a safe strategy.

Petr Savicky.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] portable parallel seeds project: request for critiques

2012-02-21 Thread Petr Savicky
On Fri, Feb 17, 2012 at 02:57:26PM -0600, Paul Johnson wrote:
> I've got another edition of my simulation replication framework.  I'm
> attaching 2 R files and pasting in the readme.
> 
> I would especially like to know if I'm doing anything that breaks
> .Random.seed or other things that R's parallel uses in the
> environment.
> 
> In case you don't want to wrestle with attachments, the same files are
> online in our SVN
> 
> http://winstat.quant.ku.edu/svn/hpcexample/trunk/Ex66-ParallelSeedPrototype/

Hi.

In the description of your project in the file

  
http://winstat.quant.ku.edu/svn/hpcexample/trunk/Ex66-ParallelSeedPrototype/README

you argue as follows

  Question: Why is this better than the simple old approach of
  setting the seeds within each run with a formula like
   
  set.seed(2345 + 10 * run)
   
  Answer: That does allow replication, but it does not assure
  that each run uses non-overlapping random number streams. It
  offers absolutely no assurance whatsoever that the runs are
  actually non-redundant.

The following demonstrates that the function set.seed() for
the default generator indeed allows to have correlated streams.

  step <- function(x)
  {
  x[x < 0] <- x[x < 0] + 2^32
  x <- (69069 * x + 1) %% 2^32
  x[x > 2^31] <- x[x > 2^31] - 2^32
  x
  }

  n <- 1000
  seed1 <- 124370417 # any seed
  seed2 <- step(seed1)

  set.seed(seed1)
  x <- runif(n)
  set.seed(seed2)
  y <- runif(n)

  rbind(seed1, seed2)
  table(x[-1] == y[-n])

The output is

 [,1]
  seed1 124370417
  seed2 205739774
  
  FALSE  TRUE 
  5   994 

This means that if the streams x, y are generated from the two
seeds above, then y is almost exactly equal to x shifted by 1.

What is the current state of your project?

Petr Savicky.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] OpenMP and random number generation

2012-02-22 Thread Petr Savicky
Hi.

> Now that R has OpenMP facilities, I'm trying to use it for my own package but 
> I'm still wondering if it is safe to use random number generation within a 
> OpenMP block. I looked at the R writing extension document  both on the 
> OpenMP and Random number generation but didn't find any information about 
> that.

Package CORElearn (i am a coauthor) uses random number generation
under OpenMP in C++. This requires to have a separate copy of the
generator with its own memory for each thread.

Do you want to use it in C or C++?

Petr Savicky.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] portable parallel seeds project: request for critiques

2012-02-22 Thread Petr Savicky
On Wed, Feb 22, 2012 at 12:17:25PM -0600, Paul Johnson wrote:
[...]
> In order for this to be easy for users, I need to put the init streams
> and set current stream functions into a package, and then streamline
> the process of creating the seed array.  My opinion is that CRAN is
> now overflowed with too many "one function" packages, I don't want to
> create another one just for these two little functions, and I may roll
> it into my general purpose regression package called "rockchalk".

I am also preparing a solution to the problem. One is based on AES
used for initialization of the R base Mersenne-Twister generator,
so it only replaces set.seed() function. Another solution is based
on "rlecuyer" package. I suggest to discuss the possible solutions
off-list before submitting to CRAN.

> One technical issue that has been raised to me is that R parallel's
> implementation of the L'Ecuyer generator is based on integer valued
> variables, whereas the original L'Ecuyer code uses double real
> variables.  But I'm trusting the R Core on this, if they code the
> generator in a way that is good enough for R itself, it is good enough
> for me. (And if that's wrong, we'll all find out together :) ).

I do not know about any L'Ecuyer's generator in R base. You probably
mean the authors of the extension packages with these generators.

> Josef Leydold (the rstream package author) has argued that R's
> implementation runs more slowly than it ought to. We had some
> correspondence and I tracked a few threads in forums. It appears the
> approach suggested there is roadblocked by some characteristics deep
> down in R and the way random streams are managed.  Packages have only
> a limited, tenuous method to replace R's generators with their own
> generators.

In order to connect a user defined generator to R, there are two
obligatory entry points "user_unif_rand" and "user_unif_init".
The first allows to call the generator from runif() and the similar
functions. The second connects the generator to set.seed() function.
If there is only one extension package with a generator loaded
to an R session, then these entry points are good enough. If the
package provides several generators, like "randtoolbox", it is
possible to change between them easily using functions provided
by the package for this purpose. I think that having several
packages with generators simultaneously can be good for their
development, but this is not needed for their use in applications.

There are also two other entry points "user_unif_nseed" and
"user_unif_seedloc", which allow to support the variable ".Random.seed".
A problem with this is that R converts the internal state of the
generator to ".Random.seed" by reading a specific memory location,
but does not alert the package about this event. So, if the state
requires a transformation to integer before storing to ".Random.seed",
it is not possible to do this only when needed.

In the package "rngwell19937", i included some code that tries to
determine, whether the user changed ".Random.seed" or not. The reason
is that most of the state is integer and is stored to ".Random.seed",
but the state contains also a function pointer, which is not stored.
It can be recomputed from ".Random.seed" and this recomputing is done,
if the package detects a change of ".Random.seed". This is not a nice
solution. So in "randtoolbox" we decided not to support ".Random.seed".

I understand that in the area of parallel computing, the work
with ".Random.seed" is a good paradigm, but if the generator
provides other tools for converting the state to an R object
and put it back to the active state, then ".Random.seed" is not
strictly necessary.

> Parallel Random Number Generation in C++ and R Using RngStream
> Andrew Karl · Randy Eubank · Dennis Young
> http://math.la.asu.edu/~eubank/webpage/rngStreamPaper.pdf

Thank you very much for this link.

All the best, Petr.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] portable parallel seeds project: request for critiques

2012-03-02 Thread Petr Savicky
On Fri, Mar 02, 2012 at 10:36:14AM +0100, Karl Forner wrote:
[...]
> Hello,
> I would be also in favor for using multiple seeds based on (seed,
> task_number) for convenience (i.e. avoiding storing the seeds)
> and with the possibility of having a dynamic number of tasks, but I am mot
> sure it is theoretically correct.
> But I can refer you to this article:
> http://www.agner.org/random/ran-instructions.pdf , section 6.1
> where the author states:
> 
> For example, if we make 100 streams of 10^10 random numbers each from an
> > SFMT
> > generator with cycle length ρ = 2^11213, we have a probability of overlap
> > p ≈ 10^3362.
> >
> 
> What do you think ? I am very concerned by the correctness of this approach
> so would appreciate any advice on that matter.

Hi.

If correctness is crucial, then the literature on cryptography provides
the best answers. At the current state of knowledge, it is not possible
to prove that a generator is undistinguishable from truly random numbers
in a mathematical sense. The problem is that if we can prove this for
any generator, then the proof also implies P \not= NP. This is an open
problem, so proving that any generator is correct in the sense of
undistinguishability is also open.

The best, what we can have, is a generator, which cannot be distinguished
from truly random numbers using the known methods, although experts on
cryptography tried to find such a method intensively. An example of such a
generator is Fortuna generator using AES, which is available at CRAN as 
"randaes".

  http://en.wikipedia.org/wiki/Fortuna_PRNG

The same can be said about initializations. Initialization is a random
number generator, whose output is used as the initial state of some
other generator. There is no proof that a particular initialization cannot
be distinguished from truly random numbers in a mathematical sense for
the same reason as above.

A possible strategy is to use a cryptographically strong hash function
for the initialization. This means to transform the seed to the initial
state of the generator using a function, for which we have a good
guarantee that it produces output, which is computationally hard to
distinguish from truly random numbers. For this purpose, i suggest
to use the package rngSetSeed provided currently at

  http://www.cs.cas.cz/~savicky/randomNumbers/

It is based on AES and Fortuna similarly as "randaes", but these
components are used only for the initialization of Mersenne-Twister.
When the generator is initialized, then it runs on its usual speed.

In the notation of

  http://www.agner.org/random/ran-instructions.pdf

using rngSetSeed for initialization of Mersenne-Twister is Method 4
in Section 6.1.

I appreciate comments.

Petr Savicky.

P.S. I included some more comments on the relationship of provably good
random number generators and P ?= NP question to the end of the page

  http://www.cs.cas.cz/~savicky/randomNumbers/

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] portable parallel seeds project: request for critiques

2012-03-02 Thread Petr Savicky
On Fri, Mar 02, 2012 at 10:36:14AM +0100, Karl Forner wrote:
[...]
> Hello,
> I would be also in favor for using multiple seeds based on (seed,
> task_number) for convenience (i.e. avoiding storing the seeds)
> and with the possibility of having a dynamic number of tasks, but I am mot
> sure it is theoretically correct.
> But I can refer you to this article:
> http://www.agner.org/random/ran-instructions.pdf , section 6.1
> where the author states:
> 
> For example, if we make 100 streams of 10^10 random numbers each from an
> > SFMT
> > generator with cycle length ρ = 2^11213, we have a probability of overlap
> > p ≈ 10^3362.
> >
> 
> What do you think ? I am very concerned by the correctness of this approach
> so would appreciate any advice on that matter.

Hi.

First, let us consider a theoretical scenario, where the starting
points are chosen independently and from the uniform distribution.
Then the above should be OK.

SFMT supports several different periods, one of them is 2^11213-1.
If we choose 100 random starting points in the cycle of this
length and consider intervals of length 10^10 each, then two
of them overlap with probability

  (2 * 10^10 - 1) / (2^11213 - 1)

which is approx 10^-3365. An upper bound on the probability that
any of the 100 overlap is

  choose(100, 2) * (2 * 10^10 - 1) / (2^11213 - 1)

which is still negligible and approx 10^-3362 as you say except
of a typo in the sign.

The crucial assumption is that we choose the starting points
from the uniform distribution over the period. Every point
on the period cycle is represented by a different internal
state of the generator. So, uniform distribution on the period
is equivalent to the uniform distribution on admissible states. 
An admissible state is represented by 11213 bits, which are
not all 0. The probability of all 0 is extremely small. So, 
in order to generate one such state, it is sufficient to fill
the state array with uniform i.i.d. bits. This is suitable
only as a theoretical model, since this can guarantee
reproducibility only if we store the whole state.

In order to guarantee reproducibility with simpler means, we
cannot fill the initial state with uniform i.i.d. random bits.
So, we do not have a uniform distribution over the period, but we
can have different approximations to it. A cryptographically
strong hash function is such an approximation in the sense
of computational indistinguishability. It does not and also
cannot approximate the uniform distribution in the statistical
sense, since the set of possible outputs is much smaller than
the period. However, if someone gives us two initial states,
one from the hash function and the other created from uniform
i.i.d. bits, then there is no known computational method to
distinguish these two in moderate CPU time. So, you can safely
use the output of the hash function for simulations.

In fact, the requirements of simulations are weaker. For example,
MD5 cannot be used for cryptography any more, since there are known
algorithms to break it. However, if you use it for a simulation,
then the simulation will be biased only if it contains an algorithm,
which breaks MD5. The probability that this happens just by chance
is small.

Petr Savicky.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] portable parallel seeds project: request for critiques

2012-03-02 Thread Petr Savicky
On Fri, Mar 02, 2012 at 01:36:34PM +0100, Karl Forner wrote:
> Thanks for your quick reply.
> 
> About the rngSetSeed package: is it usable at c/c++ level ?

Not directly. The rngSetSeed package is meant to provide an R-level
alternative to set.seed() for Mersenne-Twister with a better guarantee
that different seeds provide unrelated streams. In particular, with
rngSetSeed, it should be safe to reseed quite often, for example
in each iteration of a loop like

  for (i in 1:n) {
  setVectorSeed(c(base.seed, i))
  # some code
  }

Otherwise everything remains the same as with set.seed(). In order
to maintain several streams, one has to maintain copies of .Random.seed
and use them, when required.

> Hmm I had not paid attention to the last paragraph:
> 
> > The seeding procedure used in the
> > present software use*s a separate random number* generator of a different
> > design in order to
> > avoid any interference. An extra feature is the RandomInitByArray function
> > which makes
> > it possible to initialize the random number generator with multiple seeds.
> > We can make sure
> > that the streams have different starting points by using the thread id as
> > one of the seeds.
> >
> 
> So it means that I am already using this solution ! (in the RcppRandomSFTM,
> see other post).
> and that I should be reasonably safe.

If RcppRandomSFTM uses the initialization for SFMT provided by the
authors of SFMT, then you should be reasonably safe. I do not know
exactly the SFMT initializations, but for the original Mersenne-Twister,
the initializations from 2002, both the single number seed and a vector
seed avoid known problems with the previous initializations. It should
be pointed out, however, that the initialization by array from 2002 is
more careful than the single number initialization, So, it is better to
use the array initialization even for a single number seed.

Petr Savicky.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] suggestion for extending ?as.factor

2009-05-07 Thread Petr Savicky
On Wed, May 06, 2009 at 10:41:58AM +0200, Martin Maechler wrote:
>  PD> I think that the real issue is that we actually do want almost-equal
>  PD> numbers to be folded together. 
> 
> yes, this now (revision 48469) will happen by default, using  signif(x, 15) 
> where '15' is the default for the new optional argument 'digitsLabels'

On some platforms, the function factor() in the current R 2.10.0
(2009-05-06 r48478) may produce duplicated levels. The examples are
in general platform dependent. The following one produces duplicated
(in fact triplicated) levels on both Intel default arithmetic and
on Intel with SSE.

  x <- 9.7738826945424 + c(-1, 0, 1) * 1e-14
  x <- signif(x, 15)
  factor(x)
  # [1] 9.7738826945424 9.7738826945424 9.7738826945424
  # Levels: 9.7738826945424 9.7738826945424 9.7738826945424
  # Warning message:
  # In `levels<-`(`*tmp*`, value = c("9.7738826945424", "9.7738826945424",  :
  #   duplicated levels will not be allowed in factors anymore

The reason is that the three numbers remain different in signif(x, 15),
but are mapped to the same string in as.character(x).

  length(unique(x)) # [1] 3
  length(unique(as.character(x))) # 1

Further examples may be found using

  x <- as.character(9 + runif(5000))
  x <- as.numeric(x[nchar(x)==15]) # select numbers with 14 digits
  x <- signif(cbind(x - 1e-14, x, x + 1e-14), 15)
  y <- array(as.character(x), dim=dim(x))
  x <- x[which(y[,1] == y[,3]),]
  factor(x[1,])

Petr.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] suggestion for extending ?as.factor

2009-05-08 Thread Petr Savicky
On Wed, May 06, 2009 at 10:41:58AM +0200, Martin Maechler wrote:
>  PD> I think that the real issue is that we actually do want almost-equal
>  PD> numbers to be folded together. 
> 
> yes, this now (revision 48469) will happen by default, using  signif(x, 15) 
> where '15' is the default for the new optional argument 'digitsLabels'
> {better argument name? (but must nost start with 'label')}

Let me analyze the current behavior of factor(x) for numeric x with 
missing(levels)
and missing(labels). In this situation, levels are computed as sort(unique(x))
from possibly transformed x. Then, labels are constructed by a conversion of the
levels to strings.

I understand the current (R 2.10.0, 2009-05-07 r48492) behavior as follows.

  If keepUnique is FALSE (the default), then
- values x are transformed by signif(x, digitsLabels)
- labels are computed using as.character(levels)
- digitsLabels defaults to 15, but may be set to any integer value

  If keepUnique is TRUE, then
- values x are preserved
- labels are computed using sprintf("%.*g", digitsLabels, levels)
- digitsLabels defaults to 17, but may be set to any integer value

There are several situations, when this approach produces duplicated levels.
Besides the one described in my previous email, there are also others
 factor(c(0.3, 0.1+0.2), keepUnique=TRUE, digitsLabels=15)
 factor(1 + 0:5 * 1e-16, digitsLabels=17)

I would like to suggest a modification. It eliminates most of the cases, where
we get duplicated levels. It would eliminate all such cases, if the function
signif() works as expected. Unfortunately, if signif() works as it does in the
current versions of R, we still get duplicated levels.

The suggested modification is as follows.

  If keepUnique is FALSE (the default), then
- values x are transformed by signif(x, digitsLabels)
- labels are computed using sprintf("%.*g", digitsLabels, levels)
- digitsLabels defaults to 15, but may be set to any integer value

  If keepUnique is TRUE, then
- values x are preserved
- labels are computed using sprintf("%.*g", 17, levels)
- digitsLabels is ignored

Arguments for the modification are the following.

1. If keepUnique is FALSE, then computing labels using as.character() leads
   to duplicated labels as demonstrated in my previous email. So, i suggest to
   use sprintf("%.*g", digitsLabels, levels) instead of as.character().

2. If keepUnique is TRUE and we allow digitsLabels less than 17, then we get
   duplicated labels. So, i suggest to force digitsLabels=17, if 
keepUnique=TRUE.

If signif(,digitsLabels) works as expected, than the above approach should not
produce duplicated labels. Unfortunately, this is not the case.
There are numbers, which remain different in signif(x, 16), but are mapped
to the same string in sprintf("%.*g", 16, x). Examples of this kind may be
found using the script

   for (i in 1:50) {
   x <- 10^runif(1, 38, 50)
   y <- x * (1 + 0:500 * 1e-16)
   y <- unique(signif(y, 16))
   z <- unique(sprintf("%.16g", y))
   stopifnot(length(y) == length(z))
   }

This script is tested on Intel default arithmetic and on Intel with SSE.

Perhaps, digitsLabels = 16 could be forbidden, if keepUnique is FALSE.

Unfortunately, a similar problem occurs even for digitsLabels = 15, although for
much larger numbers.

   for (i in 1:200) {
   x <- 10^runif(1, 250, 300)
   y <- x * (1 + 0:500 * 1e-16)
   y <- unique(signif(y, 15))
   z <- unique(sprintf("%.15g", y))
   stopifnot(length(y) == length(z))
   }

This script finds collisions, if SSE is enabled, on two Intel computers, where 
i did
the test. Without SSE, it finds collisions only on one of them. May be, it 
depends
also on the compiler, which is different.

Petr.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] suggestion for extending ?as.factor

2009-05-08 Thread Petr Savicky
On Fri, May 08, 2009 at 03:18:01PM +0200, Martin Maechler wrote:
> As long as we don't want to allow  factor() to fail --rarely -- 
> I think (and that actually has been a recurring daunting thought
> for quite a few days) that we probably need an
> extra step of checking for duplicate levels, and if we find
> some, recode "everything". This will blow up the body of the
> factor() function even more.
> 
> What alternatives do you (all R-devel readers!) see?

The command 
  f <- match(x, levels)
in factor() uses the original values and not their string representations.
I think that the main reason to do so is that we loose the ordering, if the
conversion to character is done before levels are sorted.

Let me suggest to consider the following modification, where match() is done
on the strings, not on the original values.
  levels <- unique(as.character(sort(unique(x
  x <- as.character(x)
  f <- match(x, levels)

Since unique() preserves the order, we will get the levels correctly
ordered. Due to using unique() twice, we will not have duplicated levels.

Is it correct?

Petr.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] suggestion for extending ?as.factor

2009-05-08 Thread Petr Savicky
On Fri, May 08, 2009 at 05:14:48PM +0200, Petr Savicky wrote:
> Let me suggest to consider the following modification, where match() is done
> on the strings, not on the original values.
>   levels <- unique(as.character(sort(unique(x
>   x <- as.character(x)
>   f <- match(x, levels)

An alternative solution is
  ind <- order(x)
  x <- as.character(x) # or any other conversion to character
  levels <- unique(x[ind]) # get unique levels ordered by the original values
  f <- match(x, levels)

The advantage of this over the suggestion from my previous email is that
the string conversion is applied only once. The conversion need not be only
as.character(). There may be other choices specified by a parametr. I have
strong objections against the existing implementation of as.character(),
but still i think that as.character() should be the default for factor()
for the sake of consistency of the R language.

Petr.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] suggestion for extending ?as.factor

2009-05-08 Thread Petr Savicky
On Fri, May 08, 2009 at 06:48:40PM +0200, Martin Maechler wrote:
> >>>>> "PS" == Petr Savicky 
> >>>>> on Fri, 8 May 2009 18:10:56 +0200 writes:
[...]
> PS> ... I have
> PS> strong objections against the existing implementation of 
> as.character(),
> 
> {(because it is not *accurate* enough, right ?)}

The problem is not exactly low accuracy. The problem is unpredictable
accuracy. If the accuracy is systematically 15 or 14 digits, it would be
fine and suitable for most purposes.

However the accuracy ranges between 14 and 20 digits and may be different
on different platforms. For example, on my old Xeon comupter, the same
numbers may be converted to strings representing different values:

  with SSE   without SSE

  "8459184.47742229" "8459184.4774223" 
  "84307700406756081664" "8.4307700406756e+19" 
  "9262815.27852281" "9262815.2785228" 
  "2.1006742758024e+19"  "21006742758023974912"
  "7.07078598983389e+25" "7.0707859898339e+25" 
  "8.0808066145628e+28"  "8.08080661456281e+28"
  "9180932974.85929" "9180932974.8593" 
  "72.4923408890729" "72.492340889073" 

Sometimes there are differences in trailing zeros.

  with SSE   without SSE

  "1.97765325859480e+25" "1.9776532585948e+25" 
  "21762633836.0360" "21762633836.036" 
  "2018960238339.80" "2018960238339.8" 
  "239567.78053486"  "239567.780534860"
  "2571116684765.50" "2571116684765.5" 
  "3989945.2102949"  "3989945.21029490"
  "1.1259245205867e+23"  "1.12592452058670e+23"
  "3.2867033904477e+29"  "3.28670339044770e+29"
  "2.8271117654895e+29"  "2.82711176548950e+29"
  "26854166.6173020" "26854166.617302" 
  "4.85247217360750" "4.8524721736075" 
  "345123.247838540" "345123.24783854" 

For random numbers in the sample generated as 10^runif(10, 0, 30),
from which i selected the first 20 examples above, the probability of
different results was almost 0.01 (978 differences among 10 numbers).

I think that the platform dependence even limits the advantage
of backward compatibility.

Petr.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] suggestion for extending ?as.factor

2009-05-10 Thread Petr Savicky
On Sat, May 09, 2009 at 10:55:17PM +0200, Martin Maechler wrote:
[...]
> If'd revert to such a solution,
> we'd have to get back to Peter's point about the issue that
> he'd think  table(.) should be more tolerant than as.character()
> about "almost equality".
> For compatibility reasons, we could also return back to the
> reasoning that useR should use {something like}
> table(signif(x, 14)) 
> instead of
> table(x) 
> for numeric x in "typical" cases.

In the released versions 2.8.1 and 2.9.0, function factor() satisfies
  identical(as.character(factor(x)), as.character(x))(*)
for all numeric x. This follows from the code (levels are computed by
as.character() from unmodified input values) and may be verified
even for the problematic cases, for example
  x <- (0.3 + 2e-16 * c(-2,-1,1,2))
  factor(x)
  # [1] 0.300 0.3   0.3   0.300
  # Levels: 0.300 0.3 0.3 0.300
  as.character(x)
  # [1] "0.300" "0.3"   "0.3"  
  # [4] "0.300"
  identical(as.character(factor(x)), as.character(x))
  # [1] TRUE

In my opinion, it is reasonable to require that (*) be preserved also in future
versions of R.

Function as.character(x) has disadvantages. Besides of the platform dependence,
it also does not always perform rounding needed to eliminate FP errors. Usually,
as.character(x) rounds to at most 15 digits, so, we get, for example
  as.character(0.1 + 0.2) # [1] "0.3"
as required. However, there are also exceptions, for example
  as.character(1e19 + 1e5) # [1] "10100352"

Here, the number is printed exactly, so the resulting string contains the FP 
error
caused by the fact that 1e19 + 1e5 has more than 53 significant digits in binary
representation, namely 59.

  binary representation of 1e19 + 1e5 is
  100010101100011100100011010010001000100111101010

  binary representation of 10100352 is
  100010101100011100100011010010001000100110001000

However, as.character(x) seems to do enough rounding for most purposes, 
otherwise
it would not be suitable as the basic numeric to character conversion. If 
table() needs
factor() with a different conversion than as.character(x), it may be done 
explicitly
as discussed by Martin above.

So, i suggest to use as.character() as the default conversion in factor(), so 
that 
  identical(as.character(factor(x)), as.character(x))
is satisfied for the default usage of factor().

Of course, i appreciate, if factor() has parameters, which allow better control
of the underlying conversion, as it is done in the current development versions.

Petr.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] suggestion for extending ?as.factor

2009-05-11 Thread Petr Savicky
On Mon, May 11, 2009 at 05:06:38PM +0200, Martin Maechler wrote:
[...]
> The version I have committed a few hours ago is indeed a much
> re-simplified version, using  as.character(.) explicitly
> and consequently no longer providing the extra optional
> arguments that we have had for a couple of days.
> 
> Keeping such a basic function   factor()  as simple as possible 
> seems a good strategy to me.

OK. I understand the argument of simplicity. So, factor(x) is just
a compressed encoding of as.character(x), where each value is stored
only once. This sounds good to me.

Let me go back to the original purpose of this thread: suggestion for
extending ?as.factor

I think that somewhere in the help page, we could have something like

  Using factor() to a numeric vector should be done with caution. The
  information in x is preserved to the extent to which it is preserved
  in as.character(x). If this leads to too many different levels due to minor
  differences among the input numbers, it is suggested to use something like 
  factor(signif(x, digits)) or factor(round(x, digits)), where the number of 
  decimal digits appropriate for a given application should be used.

Let me point out that the following sentence from Warning is not exactly correct
as it is in svn at the moment. So, i suggest to add the word "approximately" to
the place marked with square brackets and add one more sentence of explanation
marked also by square brackets.

  To transform a factor \code{f} to [approximately]
  its original numeric values, \code{as.numeric(levels(f))[f]} is
  recommended and slightly more efficient than
  \code{as.numeric(as.character(f))}.
  [Note that the original values may be extracted only to the precision
  used in as.character(x), which is typically 15 decimal digits.]

Petr.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] suggestion for extending ?as.factor

2009-05-12 Thread Petr Savicky
On Mon, May 11, 2009 at 05:06:38PM +0200, Martin Maechler wrote:
> The version I have committed a few hours ago is indeed a much
> re-simplified version, using  as.character(.) explicitly

The current development version (2009-05-11 r48528) contains
in ?factor a description of levels parametr

  levels: an optional vector of the values that 'x' might have taken. 
  The default is the unique set of values taken by
  'character(x)', sorted into increasing order _of 'x'_.  Note
  that this set can be smaller than 'sort(unique(x))'.

I think that 'character(x)' should be 'as.character(x)'.

Petr.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] simple add error (PR#13699)

2009-05-13 Thread Petr Savicky
On Wed, May 13, 2009 at 02:35:12PM +0200, gos...@igmm.cnrs.fr wrote:
> I cannot explain why R seems to have problems adding two big numbers.
> 
> sprintf("%f",10^4+10^19) gives "10010240.00" 
> instead of "1001.00"
> 
> problems seems to arrive when i'm trying to add a big and a small number...

There are already two correct answers to your problem. If you are interested
in more detail, it is as follows. The number 10^4+10^19 is in binary
  100010101100011100100011010010001000101001110001
It has 60 significant binary digits. Machine representation (double precision)
rounds numbers to 53 significant digits, so in binary representation, it becomes
  10001010110001110010001101001000100010101000
which is 10010240 in decimal.

So, the problem is not specific to R, but to floating point numbers in general.
Floating point numbers with 53 digits are used for efficiency. If you need more
accurate arithmetic on the cost of a remarkable slow down, use a computer
algebra system. Some of them are even accessible from R, see
  http://wiki.r-project.org/rwiki/doku.php?id=misc:r_accuracy

Petr.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Spearman's rank correlation test (PR#13574)

2009-05-15 Thread Petr Savicky
Bug report "Spearman's rank correlation test (PR#13574)" was moved
to trashcan with empty Notes field. I would like to learn, what was wrong
with this bug report. Can i ask the developers to add a note to it?

Thank you in advance.

Petr.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] as.numeric(levels(factor(x))) may be a decreasing sequence

2009-05-23 Thread Petr Savicky
Function factor() in the current development version (2009-05-22)
guarantees that levels are different character strings. However, they
may represent the same decimal number. The following example is derived
from a posting by Stavros Macrakis in thread "Match .3 in a sequence"
in March

  nums <- 0.3 + 2e-16 * c(-2,-1,1,2)
  f <- factor(nums)
  levels(f)
  # [1] "0.300" "0.3"  

The levels differ in trailing zeros, but represent the same decimal number.
Besides that this is not really meaningful, it may cause a problem, when
using as.numeric(levels(f)).

In the above case, as.numeric() works fine and maps the two levels to the same
number. However, there are cases, where the difference in trailing zeros
implies different values in as.numeric(levels(f)) and these values may even
form a decreasing sequence although levels were constructed from an increasing
sequence of numbers.

Examples are platform dependent, but may be found by the following code.
Tested on Intel under Linux (both with and without SSE) and also under Windows
with an older version of R.

  for (i in 1:10) {
  x <- 10^(floor(runif(1, 61, 63)) + runif(1)/2)
  x <- as.numeric(sprintf("%.14g", x))
  eps <- 2^(floor(log2(x)) - 52)
  k <- round(x * c(5e-16, 1e-15) / eps)
  if (x > 1e62) { k <- rev( - k) }
  y <- x + k[1]:k[2] * eps
  ind <- which(diff(as.numeric(as.character(y))) < 0)
  for (j in ind) {
  u1 <- y[c(j, j+1)]
  u2 <- factor(u1)
  print(levels(u2))
  print(diff(as.numeric(levels(u2
  aux <- readline("next")
  }
  }

An example of the output is

  [1] "1.2296427920313e+61"  "1.22964279203130e+61"
  [1] -1.427248e+45
  next
  [1] "1.82328862326830e+62" "1.8232886232683e+62" 
  [1] -2.283596e+46
  next

The negative number in diff(as.numeric(levels(u2))) demonstrates cases,
when as.numeric(levels(u2)) is decreasing. We can also see that the reason
is that the two strings in levels(u2) differ in the trailing zeros.

I did quite intensive search for such examples for all possible exponents
(not only 61 and 62 and a week of CPU on three processors) and all the obtained
examples were caused by a difference in trailing zeros. So, i believe that
removing trailing zeros from the output of as.character(x) solves the problem
with the reversed order in as.numeric(levels(factor(x))) entirely.

A patch against R-devel_2009-05-22, which eliminates trailing zeros
from as.character(x), but makes no other changes to as.character(x),
is in an attachment. Using the patch, we obtain a better result also
in the following.

  nums <- 0.3 + 2e-16 * c(-2,-1,1,2)
  factor(nums)
  # [1] 0.3 0.3 0.3 0.3
  # Levels: 0.3

Petr.

--- R-devel/src/main/coerce.c   2009-04-17 17:53:35.0 +0200
+++ R-devel-elim-trailing/src/main/coerce.c 2009-05-23 08:39:03.914774176 
+0200
@@ -294,12 +294,33 @@
 else return mkChar(EncodeInteger(x, w));
 }
 
+const char *elim_trailing(const char *s, char cdec)
+{
+const char *p;
+char *replace;
+for (p = s; *p; p++) {
+if (*p == cdec) {
+replace = (char *) p++;
+while ('0' <= *p & *p <= '9') {
+if (*(p++) != '0') {
+replace = (char *) p;
+}
+}
+while (*(replace++) = *(p++)) {
+;
+}
+break;
+}
+}
+return s;
+}
+
 SEXP attribute_hidden StringFromReal(double x, int *warn)
 {
 int w, d, e;
 formatReal(&x, 1, &w, &d, &e, 0);
 if (ISNA(x)) return NA_STRING;
-else return mkChar(EncodeReal(x, w, d, e, OutDec));
+else return mkChar(elim_trailing(EncodeReal(x, w, d, e, OutDec), OutDec));
 }
 
 SEXP attribute_hidden StringFromComplex(Rcomplex x, int *warn)
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] qbinom (PR#13711)

2009-05-23 Thread Petr Savicky
On Wed, May 20, 2009 at 11:10:11PM +0200, wolfgang.re...@gmail.com wrote:
...
> Strange behavior of qbinom:
> 
> > qbinom(0.01, 5016279, 1e-07)
> [1] 0
> > qbinom(0.01, 5016279, 2e-07)
> [1] 16
> > qbinom(0.01, 5016279, 3e-07)
> [1] 16
> > qbinom(0.01, 5016279, 4e-07)
> [1] 16
> > qbinom(0.01, 5016279, 5e-07)
> [1] 0
> 

There is a bug in function do_search() in file src/nmath/qbinom.c. This
function contains a cycle
for(;;) {
if(y == 0 ||
   (*z = pbinom(y - incr, n, pr, /*l._t.*/TRUE, /*log_p*/FALSE)) < 
p)
return y;
y = fmax2(0, y - incr);
}
When this cycle stops, *z contains pbinom(y - incr, ...), but is used
as if it is pbinom(y, ...) for the resulting y.

In the example qbinom(p=0.01, size=5016279, prob=4e-07), we get at 
some step y=15 as a result of a search left with incr=50, so we have
*z=pbinom(-35, ...)=0. Hence, y=15 is treated as too low and is increased
to 16. Since 16 is detected to be sufficient, the search stops with y=16,
which is wrong.

A possible correction is to replace the code above by
double newz;
for(;;) {
if(y == 0 ||
   (newz = pbinom(y - incr, n, pr, /*l._t.*/TRUE, /*log_p*/FALSE)) 
< p)
return y;
y = fmax2(0, y - incr);
*z = newz;
}

Patch against R-devel_2009-05-22 is in an attachment.

Verification may be done using the following examples.
  b1 <- qbinom(p=0.01, size=5016279, prob=(1:20)*1e-07)
  b2 <- qbinom(p=0.01, size=5006279, prob=(1:20)*1e-07)
  b3 <- qbinom(p=0.01, size=5000279, prob=(1:20)*1e-07)
  p1 <- qpois(p=0.01, lambda=5016279*(1:20)*1e-07)
  p2 <- qpois(p=0.01, lambda=5006279*(1:20)*1e-07)
  p3 <- qpois(p=0.01, lambda=5000279*(1:20)*1e-07)
  cbind(b1, b2, b3, p1, p2, p3)

Wrong
b1 b2 b3 p1 p2 p3
   [1,]  0  0  0  0  0  0
   [2,] 16  6 50  0  0  0
   [3,] 16  6 50  0  0  0
   [4,] 16  6 50  0  0  0
   [5,]  0  0  0  0  0  0
   [6,]  0  0  0  0  0  0
   [7,]  0  0  0  0  0  0
   [8,]  0  0  0  0  0  0
   [9,]  0  0  0  0  0  0
  [10,]  1  1  1  1  1  1
  [11,]  1  1  1  1  1  1
  [12,]  1  1  1  1  1  1
  [13,]  1  1  1  1  1  1
  [14,]  2  2  2  2  2  2
  [15,]  2  2  2  2  2  2
  [16,]  2  2  2  2  2  2
  [17,] 19  9 53  3  3  3
  [18,]  3  3  3  3  3  3
  [19,]  3  3  3  3  3  3
  [20,]  3  3  3  3  3  3

Correct
b1 b2 b3 p1 p2 p3
   [1,]  0  0  0  0  0  0
   [2,]  0  0  0  0  0  0
   [3,]  0  0  0  0  0  0
   [4,]  0  0  0  0  0  0
   [5,]  0  0  0  0  0  0
   [6,]  0  0  0  0  0  0
   [7,]  0  0  0  0  0  0
   [8,]  0  0  0  0  0  0
   [9,]  0  0  0  0  0  0
  [10,]  1  1  1  1  1  1
  [11,]  1  1  1  1  1  1
  [12,]  1  1  1  1  1  1
  [13,]  1  1  1  1  1  1
  [14,]  2  2  2  2  2  2
  [15,]  2  2  2  2  2  2
  [16,]  2  2  2  2  2  2
  [17,]  3  3  3  3  3  3
  [18,]  3  3  3  3  3  3
  [19,]  3  3  3  3  3  3
  [20,]  3  3  3  3  3  3

Petr.

--- R-devel/src/nmath/qbinom.c  2007-07-25 17:54:27.0 +0200
+++ R-devel-qbinom/src/nmath/qbinom.c   2009-05-23 17:22:43.538566976 +0200
@@ -36,6 +36,7 @@
 static double 
 do_search(double y, double *z, double p, double n, double pr, double incr)
 {
+double newz;
 if(*z >= p) {
/* search to the left */
 #ifdef DEBUG_qbinom
@@ -43,9 +44,10 @@
 #endif
for(;;) {
if(y == 0 ||
-  (*z = pbinom(y - incr, n, pr, /*l._t.*/TRUE, /*log_p*/FALSE)) < 
p)
+  (newz = pbinom(y - incr, n, pr, /*l._t.*/TRUE, /*log_p*/FALSE)) 
< p)
return y;
y = fmax2(0, y - incr);
+   *z = newz;
}
 }
 else { /* search to the right */
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] inconsistency in ?factor

2009-05-25 Thread Petr Savicky
In the almost current development version (2009-05-22 r48594) and also in 
  https://svn.r-project.org/R/trunk/src/library/base/man/factor.Rd
?factor contains (compare the formulations marked by ^^)

\section{Warning}{
  The interpretation of a factor depends on both the codes and the
  \code{"levels"} attribute.  Be careful only to compare factors with
  the same set of levels (in the same order).  
  ^

\section{Comparison operators and group generic methods}{
  ...

  Only \code{==} and \code{!=} can be used for factors: a factor can
  only be compared to another factor with an identical set of levels
  (not necessarily in the same ordering) or to a character vector.
   

In the development version 2009-05-22 r48594, the latter formulation
"not necessarily in the same ordering" is correct.

  f1 <- factor(c("a", "b", "c", "c", "b", "a", "a"), levels=c("a", "b", "c"))
  f2 <- factor(c("a", "b", "c", "c", "b", "a", "c"), levels=c("c", "b", "a"))
  f1 == f2 # [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE

The first formulation "Be careful to compare ... levels in the same order" may
be just a warning against a potential problem if the levels have different 
order,
however this is not clear.

Petr.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] inconsistency in ?factor

2009-05-25 Thread Petr Savicky
On Mon, May 25, 2009 at 03:58:06PM +0800, Berwin A Turlach wrote:
> Well, the first statement is a remark on comparison in general while
> the second statement is specific to "comparison operators and generic
> methods".  There are other ways of comparing objects; note:
> 
> R> f1 <- factor(c("a", "b", "c", "c", "b", "a"), levels=c("a", "b", "c"))
> R> f2 <- factor(c("a", "b", "c", "c", "b", "a"), levels=c("c", "b", "a"))
> R> f1==f2
> [1] TRUE TRUE TRUE TRUE TRUE TRUE
> R> identical(f1,f2)
> [1] FALSE
> R> all.equal(f1,f2)
> [1] "Attributes: < Component 2: 2 string mismatches >"

I see. We have to distinguish comparison of the objects and their components.

Let me propose the following formulation
  Two factors may be identical only if they have the same set of
  levels (in the same order).
instead of
  Be careful only to compare factors with the same set of
  levels (in the same order).

Petr.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] as.numeric(levels(factor(x))) may be a decreasing sequence

2009-05-28 Thread Petr Savicky
On Wed, May 27, 2009 at 10:51:38PM +0200, Martin Maechler wrote:
> I have very slightly  modified the changes (to get rid of -Wall
> warnings) and also exported the function as Rf_dropTrailing0(),
> and tested the result with 'make check-all' .

Thank you very much for considering the patch. -Wall indeed requires to add 
parentheses
  warning: suggest parentheses around comparison in operand of &
  warning: suggest parentheses around assignment used as truth value

If there are also other changes, i would like to ask you to make your 
modification
available, mainly due to a possible further discussion.

Let me also suggest a modification of my original proposal. It contains a cycle
   while (*(replace++) = *(p++)) {
   ;
   }
If the number has no trailing zeros, but contains an exponent, this cycle
shifts the exponent by 0 positions, which means that it copies each of its
characters to itself. This may be eliminated as follows
   if (replace != p) {
   while (*(replace++) = *(p++)) {
   ;
   }
   }

Petr.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] as.numeric(levels(factor(x))) may be a decreasing sequence

2009-05-29 Thread Petr Savicky
On Fri, May 29, 2009 at 03:53:02PM +0200, Martin Maechler wrote:
> my version of *using* the function was
> 
> 1 SEXP attribute_hidden StringFromReal(double x, int *warn)
> 2 {
> 3   int w, d, e;
> 4   formatReal(&x, 1, &w, &d, &e, 0);
> 5   if (ISNA(x)) return NA_STRING;
> 6   else return mkChar(dropTrailing0(EncodeReal(x, w, d, e, OutDec), OutDec));
> 7 }
> 
> where you need to consider that mkChar() expects a 'const char*' 
> and EncodeReal(.) returns one, and I am pretty sure this was the
> main reason why Petr had used the two 'const char*' in (the
> now-named) dropTrailing0() definition. 

Yes, the goal was to accept the output of EncodeReal() with exactly the
same type, which EncodeReal() produces. A question is, whether the
output type of EncodeReal() could be changed to (char *). Then, changing
the output string could be done without casting const to non-const.

This solution may be in conflict with the structure of the rest of R code,
so i cannot evaluate, whether this is possible.

Petr.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] as.numeric(levels(factor(x))) may be a decreasing sequence

2009-05-31 Thread Petr Savicky
On Sat, May 30, 2009 at 07:32:52PM +0200, Martin Maechler wrote:
> > "vQ" == Wacek Kusnierczyk 
> > on Sat, 30 May 2009 11:16:43 +0200 writes:

[...]

> vQ> one simple way to improve the code is as follows;  instead of 
> (simplified)
> 
> vQ> const char* dropTrailing(const char* s, ...) {
> vQ> const char *p = s;
> vQ> char *replace;
> vQ> ...
> vQ> replace = (char*) p;
> vQ> ...
> vQ> return s; }
> 
> vQ> ...mkChar(dropTrailing(EncodeReal(...), ...) ...
> 
> vQ> you can have something like
> 
> vQ> const char* dropTrailing(char* s, ...) {
> vQ> char *p = s, *replace;
> vQ> ...
> vQ> replace = p;
> vQ> ...
> vQ> return s; }
> 
> vQ> ...mkChar(dropTrailing((char*)EncodeReal(...), ...) ...
>   
> vQ> where it is clear, from DT's signature, that it may (as it 
> purposefully
> vQ> does, in fact) modify the content of s.  that is, you drop the
> vQ> promise-not-to-modify contract in DT, and move the need for
> vQ> deconstifying ER's return out of DT, making it more explicit.

[...]

> vQ> (3) modify petr's solution along the lines above, i.e., have the input
> vQ> in the signature non-const and deconst-cast the output from ER outside
> vQ> of the call to DT.
> 
> that's what I have adopted, as I'm sure you've noticed when you
> saw the code above.

I appreciate the current version, which contains
  static const char* dropTrailing0(char *s, char cdec)
  ...
  mkChar(dropTrailing0((char *)EncodeReal(x, w, d, e, OutDec), ...

Here, is better visible that the cast (char *) is used than if it was hidden
inside dropTrailing0(). Also, it makes dropTrailing0() more consistent.

I would like to recall the already discussed modification
if (replace != p)
while((*(replace++) = *(p++)))
;
which saves a few instructions in the more frequent case that there are no
trailing zeros.

Petr.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] read.csv

2009-06-16 Thread Petr Savicky
On Sun, Jun 14, 2009 at 09:21:24PM +0100, Ted Harding wrote:
> On 14-Jun-09 18:56:01, Gabor Grothendieck wrote:
> > If read.csv's colClasses= argument is NOT used then read.csv accepts
> > double quoted numerics:
> > 
> > 1: > read.csv(stdin())
> > 0: A,B
> > 1: "1",1
> > 2: "2",2
> > 3:
> >   A B
> > 1 1 1
> > 2 2 2
> > 
> > However, if colClasses is used then it seems that it does not:
> > 
> >> read.csv(stdin(), colClasses = "numeric")
> > 0: A,B
> > 1: "1",1
> > 2: "2",2
> > 3:
> > Error in scan(file, what, nmax, sep, dec, quote, skip, nlines,
> > na.strings,  :
> >   scan() expected 'a real', got '"1"'
> > 
> > Is this really intended?  I would have expected that a csv file
> > in which each field is surrounded with double quotes is acceptable
> > in both cases. This may be documented as is yet seems undesirable
> > from both a consistency viewpoint and the viewpoint that it should
> > be possible to double quote fields in a csv file.
> 
> Well, the default for colClasses is NA, for which ?read.csv says:
>   [...]
>   Possible values are 'NA' (when 'type.convert' is used),
>   [...]
> and then ?type.convert says:
>   This is principally a helper function for 'read.table'. Given a
>   character vector, it attempts to convert it to logical, integer,
>   numeric or complex, and failing that converts it to factor unless
>   'as.is = TRUE'.  The first type that can accept all the non-missing
>   values is chosen.
> 
> It would seem that type 'logical' won't accept integer (naively one
> might expect 1 --> TRUE, but see experiment below), so the first
> acceptable type for "1" is integer, and that is what happens.
> So it is indeed documented (in the R[ecursive] sense of "documented" :))
> 
> However, presumably when colClasses is used then type.convert() is
> not called, in which case R sees itself being asked to assign a
> character entity to a destination which it has been told shall be
> integer, and therefore, since the default for as.is is
>   as.is = !stringsAsFactors
> but for this ?read.csv says that stringsAsFactors "is overridden
> bu [sic] 'as.is' and 'colClasses', both of which allow finer
> control.", so that wouldn't come to the rescue either.
> 
> Experiment:
>   X <-logical(10)
>   class(X)
>   # [1] "logical"
>   X[1]<-1
>   X
>   # [1] 1 0 0 0 0 0 0 0 0 0
>   class(X)
>   # [1] "numeric"
> so R has converted X from class 'logical' to class 'numeric'
> on being asked to assign a number to a logical; but in this
> case its hands were not tied by colClasses.
> 
> Or am I missing something?!!

In my opinion, you explain, how it happens that there is a difference
in the behavior between
  read.csv(stdin(), colClasses = "numeric")
and
  read.csv(stdin())
but not, why it is so.

The algorithm "use the smallest type, which accepts all non-missing values"
may well be applied to the input values either literally or after removing
the quotes. Is there a reason, why
  read.csv(stdin(), colClasses = "numeric")
removes quotes from the input values and
  read.csv(stdin())
does not?

Using double-quote characters is a part of the definition of CSV file, see,
for example
  http://en.wikipedia.org/wiki/Comma_separated_values
where one may find
  Fields may always be enclosed within double-quote characters, whether 
necessary or not.

Petr.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] read.csv

2009-06-25 Thread Petr Savicky
On Sun, Jun 14, 2009 at 02:56:01PM -0400, Gabor Grothendieck wrote:
> If read.csv's colClasses= argument is NOT used then read.csv accepts
> double quoted numerics:
> 
> 1: > read.csv(stdin())
> 0: A,B
> 1: "1",1
> 2: "2",2
> 3:
>   A B
> 1 1 1
> 2 2 2
> 
> However, if colClasses is used then it seems that it does not:
> 
> > read.csv(stdin(), colClasses = "numeric")
> 0: A,B
> 1: "1",1
> 2: "2",2
> 3:
> Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
>   scan() expected 'a real', got '"1"'
> 
> Is this really intended?  I would have expected that a csv file in which
> each field is surrounded with double quotes is acceptable in both
> cases.   This may be documented as is yet seems undesirable from
> both a consistency viewpoint and the viewpoint that it should be
> possible to double quote fields in a csv file.

The problem is not specific to read.csv(). The same difference appears
for read.table().
  read.table(stdin())
  "1" 1
  2 "2"
  
  #   V1 V2
  # 1  1  1
  # 2  2  2
but
  read.table(stdin(), colClasses = "numeric")
  "1" 1
  2 "2"
  
  Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
  scan() expected 'a real', got '"1"'

The error occurs in the call of scan() at line 152 in 
src/library/utils/R/readtable.R,
which is
  data <- scan(file = file, what = what, sep = sep, quote = quote, ...
(This is the third call of scan() in the source code of read.table())

In this call, scan() gets the types of columns in "what" argument. If the type 
is specified, scan() performs the conversion itself and fails, if a numeric 
field
is quoted. If the type is not specified, the output of scan() is of type 
character,
but with quotes eliminated, if there are some in the input file. Columns with
unknown type are then converted using type.convert(), which receives the data
already without quotes.

The call of type.convert() is contained in a cycle
for (i in (1L:cols)[do]) {
data[[i]] <-
if (is.na(colClasses[i]))
type.convert(data[[i]], as.is = as.is[i], dec = dec,
 na.strings = character(0L))
## as na.strings have already been converted to 
else if (colClasses[i] == "factor") as.factor(data[[i]])
else if (colClasses[i] == "Date") as.Date(data[[i]])
else if (colClasses[i] == "POSIXct") as.POSIXct(data[[i]])
else methods::as(data[[i]], colClasses[i])
}
which contains also lines, which could perform conversion for columns with
a specified type, but these lines are not used, since the vector "do" 
is defined as
  do <- keep & !known 
where "known" determines for which columns the type is known.

It is possible to modify the code so that scan() is called with all types
unspecified and leave the conversion to the lines
else if (colClasses[i] == "factor") as.factor(data[[i]])
else if (colClasses[i] == "Date") as.Date(data[[i]])
else if (colClasses[i] == "POSIXct") as.POSIXct(data[[i]])
else methods::as(data[[i]], colClasses[i])
above. Since this solution is already prepared in the code, the patch is very 
simple
  --- R-devel/src/library/utils/R/readtable.R 2009-05-18 17:53:08.0 
+0200
  +++ R-devel-readtable/src/library/utils/R/readtable.R   2009-06-25 
10:20:06.0 +0200
  @@ -143,9 +143,6 @@
   names(what) <- col.names
   
   colClasses[colClasses %in% c("real", "double")] <- "numeric"
  -known <- colClasses %in%
  -c("logical", "integer", "numeric", "complex", "character")
  -what[known] <- sapply(colClasses[known], do.call, list(0))
   what[colClasses %in% "NULL"] <- list(NULL)
   keep <- !sapply(what, is.null)
   
  @@ -189,7 +186,7 @@
  stop(gettextf("'as.is' has the wrong length %d  != cols = %d",
length(as.is), cols), domain = NA)
   
  -do <- keep & !known # & !as.is
  +do <- keep & !as.is
   if(rlabp) do[1L] <- FALSE # don't convert "row.names"
   for (i in (1L:cols)[do]) {
   data[[i]] <-
(Also in attachment)

I did a test as follows
  d1 <- read.table(stdin())
  "1" TRUE   3.5
  2   NA "0.1"
  NA  FALSE  0.1
  3   "TRUE" NA

  sapply(d1, typeof)
  #V1V2V3 
  # "integer" "logical"  "double" 
  is.na(d1)
  # V1V2V3
  # [1,] FALSE FALSE FALSE
  # [2,] FALSE  TRUE FALSE
  # [3,]  TRUE FALSE FALSE
  # [4,] FALSE FALSE  TRUE
  
  d2 <- read.table(stdin(), colClasses=c("integer", "logical", "double"))
  "1" TRUE   3.5
  2   NA "0.1"
  NA  FALSE  0.1
  3   "TRUE" NA

  sapply(d2, typeof)
  #V1V2V3 
  # "integer" "logical"  "double" 
  is.na(d2)
  # V1V2V3
  # [1,] FALSE FALSE FALSE
  # [2,] FALSE  TRUE FALSE
  # [3,]  TRUE FALSE FALSE
  # [4,] FALSE FALSE  TRUE

I think, there was a reason to let scan() to perform the type conversion, for
example, it may be more efficient. So, if correct,

Re: [Rd] read.csv

2009-06-25 Thread Petr Savicky
I am sorry for not including the attachment mentioned in my
previous email. Attached now. Petr.
--- R-devel/src/library/utils/R/readtable.R 2009-05-18 17:53:08.0 
+0200
+++ R-devel-readtable/src/library/utils/R/readtable.R   2009-06-25 
10:20:06.0 +0200
@@ -143,9 +143,6 @@
 names(what) <- col.names
 
 colClasses[colClasses %in% c("real", "double")] <- "numeric"
-known <- colClasses %in%
-c("logical", "integer", "numeric", "complex", "character")
-what[known] <- sapply(colClasses[known], do.call, list(0))
 what[colClasses %in% "NULL"] <- list(NULL)
 keep <- !sapply(what, is.null)
 
@@ -189,7 +186,7 @@
stop(gettextf("'as.is' has the wrong length %d  != cols = %d",
  length(as.is), cols), domain = NA)
 
-do <- keep & !known # & !as.is
+do <- keep & !as.is
 if(rlabp) do[1L] <- FALSE # don't convert "row.names"
 for (i in (1L:cols)[do]) {
 data[[i]] <-
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] factor() calls sort.list unnecessarily?

2009-07-05 Thread Petr Savicky
On Fri, Jul 03, 2009 at 07:57:49AM -0700, Martin Morgan wrote:
[...]
> sort.list is always called but used only to determine the order of
> levels, so unnecessary when levels are provided.

I think, this is correct. Replacing 
  ind <- sort.list(x)
by
  if (missing(levels))
  ind <- sort.list(x)
makes factor() more efficient, when levels parameter is not missing
and since variable ind is not needed in this case, i think, the
modification in the above form (without unique()) is correct.
Parameter levels is not used between the two tests of missing(levels),
so we get the same result in both cases as needed.

> In addition, order of
> levels is for unique values of x only. Perhaps these issues are
> addressed in the patch below? It does require unique() on the original
> argument x, rather than only on as.character(x)

Computing the order of levels is sufficient for unique levels, however,
if x is numeric, then the operation
  x <- as.character(x)
performs rounding, due to which different, but very close numbers 
are mapped to the same value. So, the length of unique(x) may change.

A possible solution could be to keep unique(x) for the original values
and refer to it, when constructing the levels. For example, as follows.
  y <- unique(x)
  ind <- sort.list(y)
  y <- as.character(y)
  levels <- unique(y[ind])
  x <- as.character(x)
  f <- match(x, levels)
This is more efficient, if the length of unique(x) is significantly
smaller then the length of x. On the other hand, if their lengths are
similar, then computing as.character() on both x and y incures some
slow down.

> At the least, perhaps
> sort.list can be called only when levels are not provided?

I support this part of the patch without unique().

Petr.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] incorrect result (41/10-1/10)%%1 (PR#13863)

2009-08-03 Thread Petr Savicky
> I get an incorrect result for
> 
> (41/10-1/10)%%1
> 
> [1] 1

Note that due to rounding errors, 41/10-1/10 is
  formatC(41/10-1/10, digits=20) # [1] "3.9995559"

Besides FAQ 7.31, related information may be found also at
  http://wiki.r-project.org/rwiki/doku.php?id=misc:r_accuracy

Petr.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Accuracy (PR#13867)

2009-08-04 Thread Petr Savicky
On Tue, Aug 04, 2009 at 04:25:09PM +0200, lueth...@student.ethz.ch wrote:
> Hi
> 
> I created the following vectors:
> 
> p_1=c(0.2,0.2,0.2,0.2,0.1,0.25,0.4,0.1,0.25,0.4,0.1,0.25,0.4,0.1,0.25,0.4,0.2,0.5,0.8,0.2,0.5,0.8,0.2,0.5,0.8,0.2,0.5,0.8)
> p_2=c(0,0,0,0,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.4,0.25,0.1,0.4,0.25,0.1,0.4,0.25,0.1,0.4,0.25,0.1)
> 
> As these are probabilities, I calculated the remainder as
> 
> p_3=1-p_1-p_2
> 
> There the values which ought to be 0.1 were lateron not recognised by 
> p_3==0.1,
> but only if I used p_3 <= 0.1.
> 
> The full calculation is actually bigger but the core issue remains: I used
> values input by hand, built a difference and it was wrong.
> 
> I know that exactly the value 0.1 is one that can not be represented using
> binary rep. Maybe that's it, maybe not.

Yes, this is the problem. In this case, one can obtain a correct
result using round()

  p1 <- c(0.2,0.2,0.2,0.2,0.1,0.25,0.4,0.1,0.25,0.4,0.1,0.25,0.4,0.1,
  0.25,0.4,0.2,0.5,0.8,0.2,0.5,0.8,0.2,0.5,0.8,0.2,0.5,0.8)
  p2 <- c(0,0,0,0,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.4,
  0.25,0.1,0.4,0.25,0.1,0.4,0.25,0.1,0.4,0.25,0.1)
  p3 <- 1 - p1 - p2
  round(p3, 2) == 0.1
  #  [1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE
  # [13]  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE
  # [25]  TRUE FALSE FALSE  TRUE

Petr.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] identical(0, -0)

2009-08-10 Thread Petr Savicky
On Sat, Aug 08, 2009 at 10:39:04AM -0400, Prof. John C Nash wrote:
> I'll save space and not include previous messages.
> 
> My 2 cents: At the very least the documentation needs a fix. If it is 
> easy to do, then Ted Harding's suggestion of a switch (default OFF) to 
> check for sign difference would be sensible.
> 
> I would urge inclusion in the documentation of the +0, -0 example(s) if 
> there is NOT a way in R to distinguish these.

It is possible to distinguish 0 and -0 in R, since 1/0 == Inf and
1/(-0) == -Inf.

I do not know, whether there are also other such situations. In particular
  (0)^(-1) == (-0)^(-1) # [1] TRUE
  log(0) == log(-0) # [1] TRUE

> There are occasions where 
> it is useful to be able to detect things like this (and NaN and Inf and 
> -Inf etc.). They are usually not of interest to users, but sometimes are 
> needed for developers to check edge effects. For those cases it may be 
> time to consider a package FPIEEE754 or some similar name to allow 
> testing and possibly setting of flags for some of the fancier features. 
> Likely used by just a few of us in extreme situations.

I think that distinguishing 0 and -0 may be useful even for nonexpert
users for debugging purposes. Mainly, because x == y does not imply
that x and y behave equally as demonstrated above or by
  x <- 0
  y <-  - 0
  x == y # [1] TRUE
  1/x == 1/y # [1] FALSE

I would like to recall the suggestion
  On Sat, Aug 08, 2009 at 03:04:07PM +0200, Martin Maechler wrote:
  > Maybe we should introduce a function that's basically
  > isTRUE(all.equal(..., tol=0))  {but faster},  or
  > do you want a 3rd argument to identical, say 'method'
  > with default  c("oneNaN", "use.==", "strict")
  > 
  > oneNaN: my proposal of using  memcmp() on doubles as its used for
  >other types already  (and hence distinguishing +0 and -0;
  >  otherwise keeping the feature that there's just one NaN
  >  which differs from 'NA' (and there's just one 'NA').
  > 
  > use.==: the previous R behaviour, using '==' on doubles 
  >   (and the "oneNaN" behavior)
  > 
  > strict: be even stricter than oneNaN:  Use  memcmp()
  >   unconditionally for doubles.  This would be the fastest
  >   version of all three.

In my opinion, for debugging purposes, the option 
identical(x,y,method="strict"),
which implies that x and y behave equally, could be useful, if it is available
in R base, 

At the R interactive level, negative zero as the value of -0 could possibly
be avoided. However, negative zero may also occur in numerical calculations,
since it may be obtained as x * 0, where x is negative. So, i think, negative
zero cannot be eliminated from consideration as something too infrequent.

Petr.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] identical(0, -0)

2009-08-10 Thread Petr Savicky
On Mon, Aug 10, 2009 at 05:47:57AM -0400, Duncan Murdoch wrote:
> I wouldn't mind a "strict" option.   It would compare bit patterns, so 
> would distinguish +0 from -0, and different NaN values.

I think that a logical option "strict" in the above meaning could be
useful for debugging. The default may be FALSE.

On Mon, Aug 10, 2009 at 10:20:39AM -0400, Duncan Murdoch wrote:
> +0 and -0 are exactly equal, which is what identical is documented to be 
> testing.  They are not indistinguishable, and not identical in the 
> English meaning of the word, but they are identical in the sense of what 
> the identical() function is documented to test.
> 
> The cases where you want to distinguish between them are rare.  They 
> should not be distinguished in the default identical() test, any more 
> than different values of NaN should be distinguished (and identical() is 
> explicitly documented *not* to distinguish those).
[...]

The question, whether 0 and -0 are equal or not, is not clear, since
they have different reciprocals. However, i agree that distinguishing
the signs of zero is rarely useful. From this point of view, the
default FALSE seems to be acceptable.

For completeness, let me also add an argument that it would not be too
harmful, if the default is TRUE. I think that it is quite rare to have
two larger numerical structures, which match up to the last bits in all
numbers, but have a different sign of some zero. Matching all bits
almost requires that the two structures are obtained using the same
expressions for all components. Then, also the signs of zeros will
match. However, i may be wrong.

Petr.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] identical(0, -0)

2009-08-11 Thread Petr Savicky
On Tue, Aug 11, 2009 at 10:04:20AM +0200, Martin Maechler wrote:
> > "DM" == Duncan Murdoch 
> > on Mon, 10 Aug 2009 11:51:53 -0400 writes:
> 
> DM> For people who want to play with these, here are some functions that 
> let 
> DM> you get or set the "payload" value in a NaN.  NaN and NA, Inf and 
> -Inf 
> DM> are stored quite similarly; these functions don't distinguish which 
> of 
> DM> those you're working with.  Regular finite values give NA for the 
> DM> payload value, and elements of x are unchanged if you try to set 
> their 
> DM> payload to NA.
> 
> DM> By the way, this also shows that R *can* distinguish different NaN 
> DM> values, but you need some byte-level manipulations.
> 
> yes;  very nice code, indeed!
> 
> I propose a version of the  showBytes()  utility should be added 
> either as an example e.g. in  writeBin() or even an exported
> function in package 'utils'
> 
>  [.]
> 
> > Example:
> 
> >> x <- c(NA, NaN, 0, 1, Inf)
> >> NaNpayload(x)
> > [1]  0.5 -0.5   NA   NA  0.0
> 
> Interestingly, on 64-bit, I get a slightly different answer above, 
> (when all the following code gives exactly the same results,
>  and of course, that was your main point !), namely
> 4.338752e-13 instead of 0.5  for 'NA', 
> see below.
> 
> .. and your nice tools also let me detect an even simpler way
> to get *two* versions of NA, and NaN, each :  
> Conclusion:  Both  NaN  and  NA (well NA_real_) have a sign, too !
> 
> NaNpayload(NA_real_)
> ##[1] 4.338752e-13
> NaNpayload(-NA_real_)
> ##[1] -4.338752e-13   ## !! different
> 
> str(NApm <- c(1[2], -1[2]))
> t(sapply(NApm, showBytes))
> ## [1,]   a2   07   00   00   00   00   f0*   7f
> ## [2,]   a2   07   00   00   00   00   f0*   ff
> 
> ## or summarizing things :
> 
> ## Or, "in summary" -- Duncan's original example slightly extended:
> x <- c(NaN, -NaN, NA, -NA_real_, 0, 0.1, Inf, -Inf)
> x
> names(x) <- format(x)
> sapply(x, showBytes)
> ##   NaN  NaN   NA   NA  0.0  0.1  Inf -Inf
> ## [1,]   00   00   a2   a2   00   9a   00   00
> ## [2,]   00   00   07   07   00   99   00   00
> ## [3,]   00   00   00   00   00   99   00   00
> ## [4,]   00   00   00   00   00   99   00   00
> ## [5,]   00   00   00   00   00   99   00   00
> ## [6,]   00   00   00   00   00   99   00   00
> ## [7,]   f8   f8   f8*  f8*  00   b9   f0   f0
> ## [8,]   ff   7f   7f   ff   00   3f   7f   ff
> 
> ## (*) NOTE: the  'f0*' or 'f8*' above  are
> ## ---   'f8' on 32-bit,  'f0' on 64-bit
> 
> 
> 
> >> NaNpayload(x) <- -0.4
> >> x
> > [1] NaN NaN NaN NaN NaN
> >> y <- x
> >> NaNpayload(y) <- 0.6
> >> y
> > [1] NaN NaN NaN NaN NaN
> >> NaNpayload(x)
> > [1] -0.4 -0.4 -0.4 -0.4 -0.4
> >> NaNpayload(y)
> > [1] 0.6 0.6 0.6 0.6 0.6
> >> identical(x, y)
> > [1] TRUE
> 

The above examples convince me that the default behavior of identical()
should not be based on bit patterns, since the differences between 
different NaN's or even different NA's are irrelevant except if we
use the bit manipulations explicitly.

Let me suggest the following short description in ?identical

  The safe and reliable way to test two objects for being equal in
  structure, types of components and their values. It returns 'TRUE' in
  this case, 'FALSE' in every other case.

and replacing the paragraph

   'identical' sees 'NaN' as different from 'NA_real_', but all
   'NaN's are equal (and all 'NA' of the same type are equal).

in ?identical by

  Comparison of objects of numeric type uses '==' for comparison of their
  components. This means that the values of the components rather
  than their machine representation is compared. In particular,
  '0' and '-0' are considered equal, all 'NA's of the same type are
  equal and all 'NaN's are equal, although their bit patterns may 
  differ in some cases. 'NA' and 'NaN' are always different. Note 
  also that 1/0 and 1/(-0) are different.

Petr.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] identical(0, -0)

2009-08-12 Thread Petr Savicky
Let me add the following to the discussion of identical(0, -0).

I would like to suggest to replace the paragraph

  'identical' sees 'NaN' as different from 'NA_real_', but all
  'NaN's are equal (and all 'NA' of the same type are equal).

in ?identical by the following text, which is a correction of my previous
suggestion for the same paragraph

  Components of numerical objects are compared as follows. For non-missing
  values, "==" is used. In particular, '0' and '-0' are considered equal. 
  All 'NA's of the same type are equal and all 'NaN's are equal, although 
  their bit patterns may differ in some cases. 'NA' and 'NaN' are always 
  different. Note also that 1/0 and 1/(-0) are different.

The suggestion for the default of identical(0, -0) is TRUE, because the
negative zero is much less important than NA na NaN and, possibly,
distinguishing 0 and -0 could even be deprecated. Moreover, the argument
of efficiency of memcmp cannot be used here, since there are different
variants of NaN and NA, which should not be distinguished by default.

Petr.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] identical(0, -0)

2009-08-12 Thread Petr Savicky
On Wed, Aug 12, 2009 at 04:02:28PM +0200, Martin Maechler wrote:
> >>>>> "PS" == Petr Savicky 
> >>>>> on Wed, 12 Aug 2009 13:50:46 +0200 writes:
> 
> PS> Let me add the following to the discussion of identical(0, -0).
> PS> I would like to suggest to replace the paragraph
> 
> PS> 'identical' sees 'NaN' as different from 'NA_real_', but all
> PS> 'NaN's are equal (and all 'NA' of the same type are equal).
> 
> PS> in ?identical by the following text, which is a correction of my 
> previous
> PS> suggestion for the same paragraph
> 
> > Components of numerical objects are compared as follows. For non-missing
> > values, "==" is used. In particular, '0' and '-0' are considered equal. 
> > All 'NA's of the same type are equal and all 'NaN's are equal, although 
> > their bit patterns may differ in some cases. 'NA' and 'NaN' are always 
> > different. 
> > Note also that 1/0 and 1/(-0) are different.
> 
> the 'numerical' would have to be qualified ('double', 'complex'
> via double), as indeed,  memcmp() is used on integers
> 
> The last sentence is not necessary and probably even confusing:
> Of course, -Inf and Inf are different.

I agree.

> PS> The suggestion for the default of identical(0, -0) is TRUE, because 
> the
> PS> negative zero is much less important than NA na NaN and, possibly,
> PS> distinguishing 0 and -0 could even be deprecated.
> 
> What should that mean??  R *is* using the international floating
> point standards, and 0 and -0 exist there and they *are*
> different!

I am sorry for being too short. In my opinion, distinguishing 0 and -0 is
not useful enough to make the default behavior of identical() different
from the behavior of == in this case.

> If  R  would start --- with a performance penalty, btw ! ---
> to explicitly map all internal '-0' into '+0'  we would
> explicitly move away from the international FP standards... 
> no way!

Yes, i agree. I did not meant this.

> PS> Moreover, the argument
> PS> of efficiency of memcmp cannot be used here, since there are different
> PS> variants of NaN and NA, which should not be distinguished by default.
> 
> your argument is only partly true... as memcmp() can still be
> used instead of '=='  *after* the NA-treatments  {my current
> patch does so},

OK. In this case, memcmp() could still be faster than ==, although 
this is beyond my knowledge.

> and even more as I have been proposing an option "strict"  which
> would only use memcmp()  {and hence also distinguish different
> NA, NaN's}.

I understand the previous messages in this thread as that there is an
agreement that such an option would be very useful and would lead
to faster comparison.

Petr.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] user supplied random number generators

2009-08-18 Thread Petr Savicky
On Mon, Aug 17, 2009 at 12:25:57PM -0700, Ross Boylan wrote:
> > Seeding by a vector is also available in the initialization of Mersenne 
> > Twister
> > from 2002. See mt19937ar.c (ar for array) at
> >   http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/emt.html
> > Unfortunately, seeding by a vector is not available in R base. R uses
> > Mersenne Twister, but with an initialization by a single number.
> I think that one could write to .Random.seed directly to set a vector
> for many of the generators.  ?.Random.seed does not recommend this and
> notes various limits and hazards of this strategy.

Initialization of a generator by writing the initial state into .Random.seed
is possible, but requires to know the exact form of the admissible states
of the generator. An error leads to a corrupted sequence. So, this is indeed
not suggested.

By seeding by a vector, i mean something different. The user provides
a vector of arbitrary length and this vector is used as an input to 
a hash function, which produces a correct initial state. The advantage
over seeding by a single number is that the set of possible initial states
is much larger already if we use a seed of length 2 or 3.

Seeding by a vector is available, for example, in Mersenne Twister from 2002
using the C function init_by_array() in mt19937ar.c.

In the package rngwell19937, the R function set.vector.seed() may be used
for the same purpose.

I will reply to the rest of your message in a private email (as was also
the email you were replying to).

Petr Savicky.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] identical(0, -0)

2009-08-22 Thread Petr Savicky
On Sat, Aug 22, 2009 at 12:00:44AM +0200, Martin Maechler wrote:
> I have taken up the issue now,
> and after thinking, studying the source, trying to define a
> 'method = ' argument, came to the conclusion that both
> the implementation and documentation (and source code "self-explanation")
> are easiest to program, maintain, and understand,
> if I introduce explicit binary switches,
> so I now  propose the following R-level interface which keeps
> the current behavior the default:
> 
> >> Usage:
> >> 
> >>  identical(x, y, num.EQ = TRUE, one.NA = TRUE, attrib.asSet = TRUE)
> >> 
> >> Arguments:
> >> 
> >> x, y: any R objects.
> >> 
> >>   num.EQ: logical indicating if ('double' and 'complex' non-'NA')
> >>   numbers should be compared using '==', or by bitwise
> >>   comparison.  The latter (non-default) differentiates between
> >>   '-0' and '+0'.
> >> 
> >>   one.NA: logical indicating if there is conceptually just one numeric
> >>   'NA' and one 'NaN';  'one.NA = FALSE' differentiates bit
> >>   patterns.
> >> 
> >> attrib.asSet: logical indicating if 'attributes' of 'x' and 'y' should
> >>   be treated as _unordered_ tagged pairlists ("sets"); this
> >>   currently also applies to 'slot's of S4 objects.  It may well
> >>   be too strict to set 'attrib.asSet = FALSE'. 

I appreciate having several binary switches. Besides the arguments above,
this is also useful for an interactive use of identical(), for example,
for debugging purposes. If there is a difference between objects, then
the switches allow to get more information concerning what is the type
of the difference.

> I'm open for better names of arguments, but will not accept "_"
> in the argument names {just my taste; no reason for argueing...}.

I would slightly prefere one.NaN instead of one.NA. In IEEE 754 terminology,
R's 'NA's are a subset of 'NaN's. So. NaN is a bit more general notion,
although in R, the sets of 'NA's an 'NaN's are disjoint. Moreover, the
name one.NaN specifies more clearly, that the issue is important only
for numeric types and not, for example, for integer.

Petr.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Unexpected behaviour of seq(from,to,by) (PR#14057)

2009-11-11 Thread Petr Savicky
> # Hi there.
> # I'm not sure whether or not this is a bug.

No, this is not a bug. See FAQ 7.31 at
  http://cran.r-project.org/doc/FAQ/R-FAQ.html
and the comments below.

> # But it surely is an unexpected behaviour.
> 
> V <- seq(from=0,to=1,by=0.1)
> 
> # Should generate a sequence with a step of 0.1
> 
> V==0
> # [1]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> # Ok!
> 
> V==0.1
> # [1] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> # Ok!
> 
> V==0.6
> # [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> # None?
> 
> V[7]
> # [1] 0.6
> V[7]==0.6
> # [1] FALSE
> # Rounding!?

Yes. The sequence of operations, which leads to V[7], produce
slightly different rounding error than the direct conversion of 0.6
to binary representation. So, we have
  formatC(V[7], digits=20) # [1] "0.60008882"
  formatC(0.6, digits=20) # [1] "0.5999778"

See 
  http://wiki.r-project.org/rwiki/doku.php?id=misc:r_accuracy
for more examples and explanations.

Petr Savicky. 

> V[8]
> # [1] 0.7
> V[8]==0.7
> # [1] FALSE
> # Rounding!?
> 
> V[9]
> # [1] 0.8
> V[9]==0.8
> # [1] TRUE
> # Back to normal
> 
> # The generated sequence is fine for all values except for 0.6
> # and 0.7, which appear to be somewhat off the expected value.
> 
> # According to the R manual the sequence is generated in the form:
> # from, from+by, ..., to
> # which is not the case.
> 
> # Either the function or the documentation lead to an unexpected
> # behaviour of seq(from,to,by).
> 
> # Thanks!
> 
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] cor.test(method = spearman, exact = TRUE) not exact (PR#14095)

2009-11-30 Thread Petr Savicky
On Mon, Nov 30, 2009 at 04:00:12AM +0100, dsim...@gmail.com wrote:
> > a <- c(1:10)
> > b <- c(1:10)
> > cor.test(a, b, method = "spearman", alternative = "greater", exact = TRUE)
> 
> Spearman's rank correlation rho
> 
> data:  a and b 
> S = 0, p-value < 2.2e-16
> alternative hypothesis: true rho is greater than 0 
> sample estimates:
> rho 
>   1 
> 
> > 1 / factorial(10)
> [1] 2.755732e-07
> 
> Since we have perfect rank correlation and only one permutation out of 10! 
> could
> give this for N = 10, the p-value should be 1/10!.  Reading the code in 
> prho.c,
> it appears that the "exact" calculation uses the Edgeworth approximation for 
> N >
> 9.  This makes sense because, for similar examples with N <= 9, the results 
> are
> as expected (1 / N!).

The Edgeworth approximation is not exact, although the error is quite small.
Computing the exact values has time complexity roughly 2^n, if Ryser's formula
for permanent is used. In cases, where one needs the exact values for small n,
it is possible to use the package pspearman at CRAN, which includes precomuted
values up to n=22. See also the paper
  M.A. van de Wiel and A. Di Bucchianico,
  Fast computation of the exact null distribution of Spearman's rho and Page's 
L statistic
  for samples with and without ties, J. Stat. Plann. Inf. 92 (2001), pp. 
133-145.
which deals with this question.

  library(pspearman)
  a <- c(1:10)
  b <- c(1:10)
  out <- spearman.test(a, b, alternative = "greater", approximation="exact")
  out
  # Spearman's rank correlation rho
  # data:  a and b 
  # S = 0, p-value = 2.756e-07
  # alternative hypothesis: true rho is greater than 0 
  # sample estimates:
  # rho 
  #   1 
  out$p.value - 1/factorial(10) # [1] 0

PS.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] R CMD check may not detect a code/documentation mismatch

2009-12-13 Thread Petr Savicky
For the package at
  http://www.cs.cas.cz/~savicky/R-devel/something_0.0.0.tar.gz
which is a minor part of some other package only to demonstrate the
problem, i get (under R version 2.11.0 Under development 2009-12-12 r50714
and also under R-2.9.2, openSUSE 11.1 (x86_64) and CentOS release 5.2)

  R CMD check something_0.0.0.tar.gz

  ...
  * checking Rd files ... OK
  * checking Rd metadata ... OK
  * checking Rd cross-references ... OK
  * checking for missing documentation entries ... OK
  * checking for code/documentation mismatches ... OK
  * checking Rd \usage sections ... OK
  * checking examples ... NONE
  * checking PDF version of manual ... OK

although the package code contains

  testCoreNA <- function()

and the documentation contains

  \usage{
  testCoreClass(verbose=0)
  testCoreAttrEval(verbose=0)
  testCoreReg(verbose=0)
  testCoreNA(verbose=0)
  }

There is a mismatch between code and documentation of testCoreNA(). Is the 
problem caused by having four entries in \usage{} section?

Petr Savicky.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] R CMD check may not detect a code/documentation mismatch

2009-12-14 Thread Petr Savicky
On Mon, Dec 14, 2009 at 09:24:12AM +0100, Kurt Hornik wrote:
> >>>>> Peter Dalgaard writes:
[...]
> > Hmm, looks more like a thinko in this code inside codoc():
> 
> >  functions_in_code <- Filter(function(f) {
> >  f <- get(f, envir = code_env)
> >  is.function(f) && (length(formals(f)) > 0L)
> >  }, objects_in_code)
> 
> > which, further down the line, causes functions with no formal arguments 
> > to be skipped when compared to the usage section.
> 
> > Browse[2]>
> > debug: ind <- (!functions %in% functions_to_be_ignored & functions %in%
> >  functions_in_code)
> > Browse[2]> functions
> > [1] "testCoreClass""testCoreAttrEval" "testCoreReg" 
> > "testCoreNA"
> > Browse[2]>
> > debug: bad_functions <- mapply(functions[ind], exprs[ind], FUN = 
> > function(x,
> >  y) check_codoc(x, as.pairlist(as.alist.call(y[-1L]))), SIMPLIFY = 
> > FALSE)
> > Browse[2]> ind
> > [1]  TRUE  TRUE  TRUE FALSE
> 
> > I.e. testCoreNA is never tested by check_codoc. There may of course be
> > a rationale for this, but it escapes me...
> 
> Well, I am sure I had good reasons when I wrote the code many years ago,
> but of course I no longer recall what they were.
> 
> Did you try the effect of removing the length(formals(f)) test?

I understand this as the question, what is the influence of the following patch
(indented by two spaces).

  diff --minimal -U 3 -r R-devel_2009-12-13/src/library/tools/R/QC.R 
R-devel_length_formals/src/library/tools/R/QC.R
  --- R-devel_2009-12-13/src/library/tools/R/QC.R   2009-10-27 
02:59:10.0 +0100
  +++ R-devel_length_formals/src/library/tools/R/QC.R   2009-12-14 
18:45:55.0 +0100
  @@ -398,7 +398,7 @@
   functions_in_code <-
   Filter(function(f) {
  f <- get(f, envir = code_env)
  -   is.function(f) && (length(formals(f)) > 0L)
  +   is.function(f)
  },
  objects_in_code)
   ## Sourcing all R code files in the package is a problem for base,

Since i do not know this part of R code, i simply applied the patch to the
current R-devel and run "make" and "make check" (both OK) and compared
the output of

  R CMD check XML_2.6-0.tar.gz
  R CMD check randomForest_4.5-33.tar.gz
  R CMD check tree_1.0-27.tar.gz
  R CMD check survival_2.35-7.tar.gz

with R being both original R-devel and the modified one. The results were 
identical
in all four cases (empty diff) on two Linux machines (CentOS and openSUSE).

On the other hand, for the package
  http://www.cs.cas.cz/~savicky/R-devel/something_0.0.0.tar.gz
which demonstrates the problem, the modified R detected the code/doc mismatch.
The diff of the two outputs of R CMD check was

  [savi...@uivtx test]$ diff original modified 
  35c35,42
  < * checking for code/documentation mismatches ... OK
  ---
  > * checking for code/documentation mismatches ... WARNING
  > Codoc mismatches from documentation object 'testCore':
  > testCoreNA
  >   Code: function()
  >   Docs: function(verbose = 0)
  >   Argument names in docs not in code:
  > verbose
  > 
  39a47,50
  > WARNING: There was 1 warning, see
  >   /home/savicky/tmp/test/something.Rcheck/00check.log
  > for details
  > 

Petr Savicky.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Accuracy (PR#14139)

2009-12-14 Thread Petr Savicky
On Mon, Dec 14, 2009 at 06:10:16PM +0100, ber...@lycos.com wrote:
> 
> > pnorm(1.35,0,1)
> [1] 0.911492
> > pnorm(1.36,0,1)
> [1] 0.913085
> 
> > options(digits=4)
> 
> > pnorm(1.35,0,1)
> [1] 0.9115
> > pnorm(1.36,0,1)
> [1] 0.913  rounding error?

The technical explanation is as follows. If options(digits=k) is set,
then the number of significant digits for printing a single number x is 
determined as min(k, d), where d is the minimum number of digits,
for which the relative error of the printed number is less than 10^-k.

If we have
  x <- 0.913085
  y <- 0.913
then the relative error of y as an approximation of x is
  abs(y - x)/x # [1] 9.3091e-05
Since this is less than 10^-4, the 3 digit precision is chosen
for printing x.

A safer way of rounding is to use functions round() and signif().
For example,
  round(x, digits=4) # [1] 0.9131

I do not know the history of the R printing algorithm. It is designed
primarily for printing vectors, where the rules are more complicated
to achieve a good unified format for all numbers. May be, someone else
can say more about it. The above analysis may be obtained by inspecting
the R source code.

Petr Savicky.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] RFC: lchoose() vs lfactorial() etc

2009-12-15 Thread Petr Savicky
On Tue, Dec 15, 2009 at 09:49:28AM +0100, Martin Maechler wrote:
> lgamma(x) and lfactorial(x) are defined to return
> 
>  ln|Gamma(x)| {= log(abs(gamma(x)))} or  ln|Gamma(x+1)| respectively.
> 
> Unfortunately,  we haven't chosen the analogous definition for 
> lchoose().
> 
> So, currently 
> 
>   > lchoose(1/2, 1:10)
>[1] -0.6931472 -2.0794415NaN -3.2425924NaN -3.8869494
>[7]NaN -4.3357508NaN -4.6805913
>   Warning message:
>   In lchoose(n, k) : NaNs produced
>   > 
> 
> which (the NaN's) is not particularly useful.
> (I have use case where I really have to workaround those NaNs.)
> 
> I herebey propose to *amend* the definition of lchoose() such
> that it behaves analogously to  lgamma() and lfactorial(),
> i.e., to return
> 
>log(abs(choose(.,.))
> 
> Your comments are welcome.
> Martin Maechler, ETH Zurich

I do not have a strong opinion, whether lchoose() should be log(choose(.,.))
or log(abs(choose(.,.))), although i would slightly prefere the latter (with 
abs()).
One of the reasons is that this would simplify the code of the function
lchoose() in src/nmath/choose.c. It produces the NA by explicit testing
the sign, so this test could be simply removed.

Let me also point out that the current implementation seems to contain
a bug, which causes that it is neither log(choose(.,.)) nor 
log(abs(choose(.,.))).

  x <- choose(1/2, 1:10)
  y <- lchoose(1/2, 1:10)
  cbind(choose=x, lchoose=y, log.choose=log(abs(x)))
chooselchoose log.choose
  #  [1,]  0.5 -0.6931472 -0.6931472
  #  [2,] -0.12500 -2.0794415 -2.0794415
  #  [3,]  0.06250NaN -2.7725887
  #  [4,] -0.039062500 -3.2425924 -3.2425924
  #  [5,]  0.027343750NaN -3.5992673
  #  [6,] -0.020507812 -3.8869494 -3.8869494
  #  [7,]  0.016113281NaN -4.1281114
  #  [8,] -0.013092041 -4.3357508 -4.3357508
  #  [9,]  0.010910034NaN -4.5180723
  # [10,] -0.009273529 -4.6805913 -4.6805913

So, besides x[1], we get NA for those cases, when choose() is positive.
The reason seems to be the test at line 89 of src/nmath/choose.c, which is

if (fmod(floor(n-k+1), 2.) == 0) /* choose() < 0 */
return ML_NAN;

but it should be negated. I tested it using

if (fmod(floor(n-k+1), 2.) != 0) /* choose() < 0 */
return ML_NAN;

when we get

chooselchoose log.choose
   [1,]  0.5 -0.6931472 -0.6931472
   [2,] -0.12500NaN -2.0794415
   [3,]  0.06250 -2.7725887 -2.7725887
   [4,] -0.039062500NaN -3.2425924
   [5,]  0.027343750 -3.5992673 -3.5992673
   [6,] -0.020507812NaN -3.8869494
   [7,]  0.016113281 -4.1281114 -4.1281114
   [8,] -0.013092041    NaN -4.3357508
   [9,]  0.010910034 -4.5180723 -4.5180723
  [10,] -0.009273529NaN -4.6805913

Petr Savicky.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] RFC: lchoose() vs lfactorial() etc

2009-12-17 Thread Petr Savicky
only for integers, since choose(n, k) is a polynomial of
degree k. So, for different values of k, we get a different functions
on any interval of positive length.

The division at the line
r /= j;
always produces an exact integer, since during the whole cycle, r is always
either an integer binomial coefficient or its integer multiple. So, the
division has an integer result and there is no rounding error unless the
intermediate result does not fit into 53 bit precison.

Using the above code, we get

  options(digits=15)
  n <- 5 + (-15:15)*1e-8
  cbind(n, choose(n, 5))
  #n  
  #  [1,] 4.9985 0.99657500042
  #  [2,] 4.9986 0.9968070
  #  [3,] 4.9987 0.99703166698
  #  [4,] 4.9988 0.9972627
  #  [5,] 4.9989 0.99748833356
  #  [6,] 4.9990 0.9977185
  #  [7,] 4.9991 0.99794500014
  #  [8,] 4.9992 0.9981745
  #  [9,] 4.9993 0.99840166677
  # [10,] 4.9994 0.9986308
  # [11,] 4.9995 0.9988589
  # [12,] 4.9996 0.9990870
  # [13,] 4.9997 0.9993152
  # [14,] 4.9998 0.9995435
  # [15,] 4. 0.9997717
  # [16,] 5. 1.000
  # [17,] 5.0001 1.0002284
  # [18,] 5.0002 1.0004567
  # [19,] 5.0003 1.0006851
  # [20,] 5.0004 1.0009136
  # [21,] 5.0005 1.00114166671
  # [22,] 5.0006 1.0013706
  # [23,] 5.0007 1.00159833342
  # [24,] 5.0008 1.0018280
  # [25,] 5.0009 1.00205500016
  # [26,] 5.0010 1.0022853
  # [27,] 5.0011 1.00251166690
  # [28,] 5.0012 1.0027427
  # [29,] 5.0013 1.00296833365
  # [30,] 5.0014 1.00319666704
  # [31,] 5.0015 1.00342500042

Within the chosen accuracy 1e-8, there is no visible jump.
With 1e-15, we get

  options(digits=15)
  n <- 5 + (-15:15)*1e-15
  cbind(n, choose(n, 5))
  #  n  
  #  [1,] 4.98 0.966
  #  [2,] 4.99 0.967
  #  [3,] 4.99 0.970
  #  [4,] 4.99 0.972
  #  [5,] 4.99 0.976
  #  [6,] 4.99 0.978
  #  [7,] 4.99 0.980
  #  [8,] 4.99 0.982
  #  [9,] 4.99 0.984
  # [10,] 4.99 0.986
  # [11,] 4.99 0.988
  # [12,] 5.00 0.990
  # [13,] 5.00 0.994
  # [14,] 5.00 0.996
  # [15,] 5.00 0.998
  # [16,] 5.00 1.000
  # [17,] 5.00 1.002
  # [18,] 5.00 1.004
  # [19,] 5.00 1.006
  # [20,] 5.00 1.010
  # [21,] 5.01 1.012
  # [22,] 5.01 1.014
  # [23,] 5.01 1.016
  # [24,] 5.01 1.018
  # [25,] 5.01 1.020
  # [26,] 5.01 1.022
  # [27,] 5.01 1.024
  # [28,] 5.01 1.028
  # [29,] 5.01 1.031
  # [30,] 5.01 1.032
  # [31,] 5.02 1.035

So, the jump in the exactly integer value n[16] == 5 is comparable
to machine accuracy.

I would be pleased to provide more tests, if some of the above solutions
or their modifications can be accepted.

Petr Savicky.

--- R-devel_2009-12-16/src/nmath/choose.c   2008-03-17 17:52:35.0 
+0100
+++ R-devel_less_conservative/src/nmath/choose.c2009-12-17 
22:43:54.0 +0100
@@ -107,19 +107,20 @@
 #endif
 if (fabs(k - k0) > 1e-7)
MATHLIB_WARNING2(_("'k' (%.2f) must be integer, rounded to %.0f"), k0, 
k);
 if (k < k_small_max) {
int j;
-   if(n-k < k && n >= 0 && R_IS_INT(n)) k = n-k; /* <- Symmetry */
+   if(n-k < k && n >= 0 && (n == floor(n))) k = n-k; /* <- Symmetry */
if (k <  0) return 0.;
if (k == 0) return 1.;
/* else: k >= 1 */
r = n;
-   for(j = 2; j <= k; j++)
-   r *= (n-j+1)/j;
-   return R_IS_INT(n) ? floor(r + 0.5) : r;
-   /* might have got rounding errors */
+   for(j = 2; j <= k; j++) {
+   r *= (n-j+1);
+   r /= j;
+   }
+   return r;
 }
 /* else: k >= k_small_max */
 if (n < 0) {
r = choose(-n+ k-1, k);
if (ODD(k)) r = -r;
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] choose(n,k) when n is almost integer

2009-12-22 Thread Petr Savicky
 the machine, which i have now available. I will
perform the whole test later.

Both patches and also the original implementation sometimes generate a warning
of the following type

  > choose(19 - 2e-7, 30)
  [1] -3.328339e-16
  Warning message:
  In choose(19 - 2e-07, 30) :
full precision may not have been achieved in 'lgamma'

These warnings are generated in functions called from choose() and the intention
of this patch is not to change this. Most of these warnings are eliminated
in the original implementation by treating numbers, which differ from an
integer by less than 1e-7, as integers. However, there are some such cases
even if the difference is slightly larger than 1e-7 like the above 2e-7.

The cases, when the above warning is generated, are the same for both patches A 
and B
in the tests, which i did.

Patch B is significantly more accurate in cases, when the result is more than 1.

In order to verify these two claims, i used the test script at
  http://www.cs.cas.cz/~savicky/R-devel/test_choose_0.R

Its outputs with A and B are presented below. The list of cases with warning 
   Cases with warning
[,1] [,2]
   [1,] 6715   31
   [2,]  152   33
   [3,]  152   34
   [4,] 6393   37
   [5,] 9388   39
   [6,] 6059   40
   [7,] 8773   40
   [8,] 5058   44
   [9,] 5531   44
  [10,] 6670   46
  [11,]  652   47
  [12,] 2172   49
  [13,] 3371   49
  [14,] 1913   50
is the same in all four tests. So, i only include the tables of errors, which
are different. The table contains columns "bound" and "error". The error is
the maximum error achieved in cases, where the compared values are at most the
bound on the same row and more than the bounds on the previous rows.

Patch A on Intel extended arithmetic

  minimum log2() of a wrong result for integer n : 53.95423 
  maximum error for real n :
   bounderror
  [1,] 0e+00 1.206026e-08
  [2,] 1e-20 2.699943e-08
  [3,] 1e-15 3.444711e-08
  [4,] 1e-10 2.806709e-08
  [5,] 1e-05 8.395916e-09
  [6,] 1e+00 3.695634e-07
  [7,]   Inf 2.990987e-07

Patch A on SSE arithmetic

  minimum log2() of a wrong result for integer n : 49.94785 
  maximum error for real n :
   bounderror
  [1,] 0e+00 1.206026e-08
  [2,] 1e-20 2.699943e-08
  [3,] 1e-15 3.444709e-08
  [4,] 1e-10 2.806707e-08
  [5,] 1e-05 8.395951e-09
  [6,] 1e+00 3.695634e-07
  [7,]   Inf 2.990987e-07

Patch B on Intel extended arithmetic

  minimum log2() of a wrong result for integer n : 53.95423 
  maximum error for real n :
   bounderror
  [1,] 0e+00 2.047988e-09
  [2,] 1e-20 2.699943e-08
  [3,] 1e-15 3.444711e-08
  [4,] 1e-10 2.806709e-08
  [5,] 1e-05 8.395916e-09
  [6,] 1e+00 1.207856e-11
  [7,]   Inf 1.509903e-14

Patch B on SSE arithmetic

  minimum log2() of a wrong result for integer n : 49.94785 
  maximum error for real n :
   bounderror
  [1,] 0e+00 2.047988e-09
  [2,] 1e-20 2.699943e-08
  [3,] 1e-15 3.444709e-08
  [4,] 1e-10 2.806707e-08
  [5,] 1e-05 8.395951e-09
  [6,] 1e+00 1.205369e-11
  [7,]   Inf 2.731149e-14

We can see that B has smaller error, if the output is more than 1.

For integer n, we can see that the smallest output values, where we
do not get the exact result is the same for A and B. There is a difference
between Intel extended arithmetic and SSE for clear reasons.

I suggest to use B and i appreciate comments.

Petr Savicky.

--- R-devel-orig-sse/src/nmath/choose.c 2009-12-17 17:52:39.0 +0100
+++ R-devel-work-copy-1-sse/src/nmath/choose.c  2009-12-22 09:06:32.0 
+0100
@@ -109,6 +109,7 @@
MATHLIB_WARNING2(_("'k' (%.2f) must be integer, rounded to %.0f"), k0, 
k);
 if (k < k_small_max) {
int j;
+   if (R_IS_INT(n)) n = floor(n + 0.5);
if(n-k < k && n >= 0 && R_IS_INT(n)) k = n-k; /* <- Symmetry */
if (k <  0) return 0.;
if (k == 0) return 1.;
@@ -126,6 +127,7 @@
return r;
 }
 else if (R_IS_INT(n)) {
+   n = floor(n + 0.5);
if(n < k) return 0.;
if(n - k < k_small_max) return choose(n, n-k); /* <- Symmetry */
return floor(exp(lfastchoose(n, k)) + 0.5);
--- R-devel-orig-sse/src/nmath/choose.c 2009-12-17 17:52:39.0 +0100
+++ R-devel-work-copy-2-sse/src/nmath/choose.c  2009-12-22 13:49:33.0 
+0100
@@ -93,13 +93,14 @@
 return lfastchoose(n, k);
 }
 
+#define IS_INT(x)((x) == floor(x))
 #define k_small_max 30
 /* 30 is somewhat arbitrary: it is on the *safe* side:
  * both speed and precision are clearly improved for k < 30.
 */
 double choose(double n, double k)
 {
-double r, k0 = k;
+double r, k0 = k, n1;
 k = floor(k + 0.5);
 #ifdef IEEE_754
 /* NaNs propagated correctly */
@@ -109,14 +110,14 @@
MATHLIB_WARNING2(_("'k' (%.2f) must be integer, rounded to %.0f"), k0, 
k);
 if (k < k_small_max) {
int j;
-   if(n-k < k && n >= 0 &&am

Re: [Rd] choose(n,k) when n is almost integer

2009-12-22 Thread Petr Savicky
On Sat, Dec 19, 2009 at 10:02:49PM +0100, Martin Maechler wrote:
[...]
> Have you tried 'make check-devel' (or 'check-all') also with the 
> progressive change?

The patch-B.txt from my previous email passed "make check-all" with the
exception of the Tcl/Tk support, which is missing on my computer.
The test was run both with Intel extended and SSE arithmetic.

Petr Savicky.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] choose(n,k) when n is almost integer

2009-12-23 Thread Petr Savicky
In a previous email, i suggested two patches A and B to choose(n, k), which
solve some of its problems, but keep some of the inaccuracies of the original
implementation. I would like to suggest another patch, which i will call C
(patch-C.txt in an attachment), which eliminates the warnings obtained sometimes
from the original implementation and which is more accurate in all ranges of
the output.

For testing patch C a simpler script is sufficient, since we need not to take
care of the warnings. Namely
  http://www.cs.cas.cz/~savicky/R-devel/test_choose_1.R
which produces the output (the error is the relative error)

  > source("test_choose_1.R")
  k <= 0 max err = 0 
  k <= 10 max err = 1.332268e-15 
  k <= 20 max err = 2.442491e-15 
  k <= 30 max err = 3.774758e-15 
  k <= 40 max err = 2.553513e-14 
  k <= 50 max err = 2.88658e-14 
  k <= 60 max err = 3.197442e-14 
  k <= 70 max err = 4.396483e-14 
  k <= 80 max err = 4.685141e-14 
  k <= 90 max err = 4.907186e-14 
  k <= 100 max err = 5.084821e-14 
  k <= 110 max err = 5.373479e-14 
  k <= 120 max err = 5.551115e-14 
  k <= 130 max err = 7.41629e-14 
  k <= 140 max err = 9.592327e-14 
  k <= 150 max err = 9.636736e-14 
  k <= 160 max err = 9.725554e-14 
  k <= 170 max err = 9.947598e-14 
  k <= 180 max err = 1.04583e-13 
  k <= 190 max err = 1.088019e-13 
  k <= 200 max err = 1.090239e-13 
  minimum log2() of a wrong result for integer n : 53.32468 
  maximum error for real n : 1.090239e-13 

Increasing accuracy of choose(n, k) for n almost an integer needed to use
additional transformations of it to those already used in the code. I will
work out a description of these transformations and send a link to it.
Similarly as patches A and B, patch C also does not modify lchoose().

It should be pointed out that choose(n, k) for non-integer n is mainly
needed if n is a rational number like 1/2, 1/3, 2/3,  However, making
choose(n, k) accurate for all inputs seems to be not too complex as the
patch C and its test results show.

I appreciate comments on patch C.

Petr Savicky.

--- R-devel-orig-intel/src/nmath/choose.c   2009-12-17 17:52:39.0 
+0100
+++ R-devel-work-copy-3-intel/src/nmath/choose.c2009-12-23 
21:59:40.0 +0100
@@ -93,13 +93,14 @@
 return lfastchoose(n, k);
 }
 
+#define IS_INT(x)((x) == floor(x))
 #define k_small_max 30
 /* 30 is somewhat arbitrary: it is on the *safe* side:
  * both speed and precision are clearly improved for k < 30.
 */
 double choose(double n, double k)
 {
-double r, k0 = k;
+double r, k0 = k, l, aux;
 k = floor(k + 0.5);
 #ifdef IEEE_754
 /* NaNs propagated correctly */
@@ -109,36 +110,37 @@
MATHLIB_WARNING2(_("'k' (%.2f) must be integer, rounded to %.0f"), k0, 
k);
 if (k < k_small_max) {
int j;
-   if(n-k < k && n >= 0 && R_IS_INT(n)) k = n-k; /* <- Symmetry */
+   if(n-k < k && n >= 0 && IS_INT(n)) k = n-k; /* <- Symmetry */
if (k <  0) return 0.;
if (k == 0) return 1.;
/* else: k >= 1 */
r = n;
for(j = 2; j <= k; j++)
-   r *= (n-j+1)/j;
-   return R_IS_INT(n) ? floor(r + 0.5) : r;
+   r *= (n-(j-1))/j;
+   return IS_INT(n) ? floor(r + 0.5) : r;
/* might have got rounding errors */
 }
 /* else: k >= k_small_max */
 if (n < 0) {
-   r = choose(-n+ k-1, k);
-   if (ODD(k)) r = -r;
+   r = n / k * choose(k - 1.0 - n, k - 1.0);
+   if (ODD(k - 1.0)) r = -r;
return r;
 }
-else if (R_IS_INT(n)) {
+else if (IS_INT(n)) {
if(n < k) return 0.;
if(n - k < k_small_max) return choose(n, n-k); /* <- Symmetry */
return floor(exp(lfastchoose(n, k)) + 0.5);
 }
 /* else non-integer n >= 0 : */
-if (n < k-1) {
-   int s_choose;
-   r = lfastchoose2(n, k, /* -> */ &s_choose);
-   return s_choose * exp(r);
+l = floor(n + 0.5);
+if (l <= k-1) {
+   aux = lfastchoose(n, l) + lfastchoose(k - 1.0 - n, k - l - 1.0) - 
lfastchoose(k, l);
+   return exp(aux) * (n - l) / (k - l) * (ODD(k - l) ? 1.0 : - 1.0);
 }
 return exp(lfastchoose(n, k));
 }
 
 #undef ODD
+#undef IS_INT
 #undef R_IS_INT
 #undef k_small_max
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] seq.int broken (seq as well) (PR#14169)

2009-12-31 Thread Petr Savicky
On Mon, Dec 28, 2009 at 08:50:13PM +0100, oehl_l...@gmx.de wrote:
[...]
> # fine as expected from help page: 
> # "from+by, ..., up to the sequence value less than or equal to to"
> # thus 1+10=11 is not in
> > seq.int(1L, 10L, by=10L)
> [1] 1
> 
> # of course 1+1e7 should also not be in
> # but is: wrong
> > seq.int(1L, 1e7L, by=1e7L)
> [1] 1e+00 1e+07
> 
> # since we use seq.int AND are within integer range, rounding should not be an
> issue

In my opinion, this is a documented behavior. The Details section of the help
page says

 Note that the computed final value
 can go just beyond 'to' to allow for rounding error, but (as from
 R 2.9.0) is truncated to 'to'.

Since "by" is 1e7, going by 1 more than 'to' is "just beyond 'to'".

What can be a bit misleading is the following difference between the type
of seq() and seq.int(), which is only partially documented.

  x <- seq.int(from=1L, to=1000L, by=1000L); typeof(x); x
  # [1] "double"
  # [1] 1e+00 1e+07

  x <- seq(from=1L, to=1000L, by=1000L); typeof(x); x
  # [1] "integer"
  # [1]1 1000

The Value section of the help page says:

 Currently, 'seq.int' and the default method of 'seq' return a
 result of type '"integer"' (if it is representable as that and) if
 'from' is (numerically equal to an) integer and, e.g., only 'to'
 is specified, or also if only 'length' or only 'along.with' is
 specified.  *Note:* this may change in the future and programmers
 should not rely on it.

This suggests that we should get "double" in both cases, since all three
arguments "from", "to", and "by" are specified. I do not know, whether
having an "integer" result in seq() in the above case is intended or not.

Petr Savicky.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] missing R-devel/po

2010-01-06 Thread Petr Savicky
When i unpack R-devel_2010-01-05.tar.bz2 and run ./configure on
two Linux machines, i get the error message

  configure: creating ./config.status
  config.status: creating Makeconf
  config.status: creating Makefile
  config.status: creating doc/Makefile
  config.status: creating doc/html/Makefile
  config.status: creating doc/manual/Makefile
  config.status: creating etc/Makefile
  config.status: creating etc/Makeconf
  config.status: creating etc/Renviron
  config.status: creating etc/ldpaths
  config.status: creating m4/Makefile
  config.status: error: cannot find input file: po/Makefile.in.in

For R-devel_2010-01-04.tar.bz2, ./configure runs fine.

The reason could be that R-devel_2010-01-05.tar.bz2 does not
contain the directory "R-devel/po". This directory is present in
R-devel_2010-01-04.tar.bz2 and contains the file "R-devel/po/Makefile.in.in".

I see the directory "R-devel/po" at https://svn.r-project.org/R/trunk/,
so the problem could be just temporary.

Petr Savicky.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] choose(n,k) when n is almost integer

2010-02-02 Thread Petr Savicky
I would like to add some more information concerning the patch C
to the function choose() proposed in the email
  https://stat.ethz.ch/pipermail/r-devel/2009-December/056177.html

The patch uses transformations of choose(n, k), which are described in
  http://www.cs.cas.cz/~savicky/R-devel/formulas_choose.pdf

The accuracy of the modified function choose(n, k) may be verified
on randomly generated examples using Rmpfr package
  http://cran.at.r-project.org/web/packages/Rmpfr/index.html
and the script
  http://www.cs.cas.cz/~savicky/R-devel/test_choose_2.R

The output, which i obtained, is 

  > source("test_choose_2.R")
  k <= 9  max rel err = 9.41734e-16 
  k <= 19  max rel err = 2.084412e-15 
  k <= 29  max rel err = 3.170754e-15 
  k <= 39  max rel err = 4.99284e-14 
  k <= 49  max rel err = 5.927749e-14 
  k <= 59  max rel err = 6.526895e-14 
  k <= 69  max rel err = 6.526895e-14 
  k <= 79  max rel err = 8.783232e-14 
  k <= 89  max rel err = 1.051950e-13 
  k <= 99  max rel err = 1.051950e-13 
  k <= 109  max rel err = 1.072878e-13 
  k <= 119  max rel err = 1.072878e-13 
  k <= 129  max rel err = 1.179829e-13 
  k <= 139  max rel err = 1.247080e-13 
  k <= 149  max rel err = 1.247080e-13 
  k <= 159  max rel err = 1.255064e-13 
  k <= 169  max rel err = 1.255064e-13 
  k <= 179  max rel err = 1.267402e-13 
  k <= 189  max rel err = 1.311689e-13 
  k <= 199  max rel err = 1.573155e-13 
  k <= 209  max rel err = 1.844756e-13 

Patch C also passes make check-all in the current development
version 2.11.0 (2010-02-01).

I appreciate comments.

Petr Savicky.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] choose(n,k) when n is almost integer

2010-02-03 Thread Petr Savicky
On Tue, Feb 02, 2010 at 01:37:46PM +0100, Petr Savicky wrote:
> I would like to add some more information concerning the patch C
> to the function choose() proposed in the email
>   https://stat.ethz.ch/pipermail/r-devel/2009-December/056177.html
> 
> The patch uses transformations of choose(n, k), which are described in
>   http://www.cs.cas.cz/~savicky/R-devel/formulas_choose.pdf
> 
> The accuracy of the modified function choose(n, k) may be verified
> on randomly generated examples using Rmpfr package
>   http://cran.at.r-project.org/web/packages/Rmpfr/index.html
> and the script
>   http://www.cs.cas.cz/~savicky/R-devel/test_choose_2.R

Let me add an explanation of the script test_choose_2.R.

The script generates a vector of random real numbers n, which are from
a continuous distribution, but concentrate near integer values. The
original implementation of choose(m, k) considers m as an integer, if
abs(m - round(m)) <= 1e-7. The vector n is generated so that the
probability of abs(n[i] - round(n[i])) <= 1e-7 is approximately 0.1.
The distribution of n[i] - round(n[i]) is symmetric around 0, so we
get both n[i], which are close to an integer from below and from above.
On the other hand, the probability of abs(n[i] - round(n[i])) >= 0.3 is
approximately 0.1083404, so there are also numbers n[i], which are not
close to an integer.

The script calculates choose(n, k) for k in 0:209 (an ad hoc upper bound)
1. using the modified choose(n, k) from patch C
2. using the expression n(n-1)...(n-k+1)/(1 2 ... k) with Rmpfr numbers
   of precision 100 bits.
The relative difference of these two results is computed and its maximum
over all n[i] and k from 0 to a given bound is reported. The bounds on k are
chosen to be the numbers, whose last digit is 9, since the algorithm in 
choose(n,k)
is different for k <= 29 and k >= 30.

An upper bound of the relative rounding error of a single operation with
Rmpfr numbers of precision 100 bits is (1 + 2^-100). Hence, an upper bound
on the total relative error of n(n-1)...(n-k+1)/(1 2 ... k) is
(1 + 2^-100)^(2*209) \approx (1 + 2 * 209 * 2^-100) \approx 1 + 3.297e-28.
This is a negligible error compared to the errors reported by test_choose_2.R,
so the reported errors are the errors of the patched choose(n, k).

The errors reported by test_choose_2.R with patched choose(n,k) are in
a previous email.

Running test_choose_2.R with unpatched R version 2.11.0 (2010-02-01 r51089)
produces larger errors.

  > source("test_choose_2.R")
  k <= 9  max rel err = Inf 
  k <= 19  max rel err = Inf 

  > source("test_choose_2.R")
  k <= 9  max rel err = 0.111 
  k <= 19  max rel err = Inf 

  > source("test_choose_2.R")
  k <= 9  max rel err = Inf 
  k <= 19  max rel err = Inf 

  > source("test_choose_2.R")
  k <= 9  max rel err = 8.383718e-08 
  k <= 19  max rel err = 1.226306e-07 
  k <= 29  max rel err = 1.469050e-07 
  Error: segfault from C stack overflow

The Inf relative errors occur in cases, where choose(n, k) calculates 0,
but the correct result is not 0.

The stack overflow error is sometimes generated due to an infinite sequence
of transformations 
  choose(n, k) -> choose(n, n-k) -> choose(n, round(n-k))
which occur if k = 30 and n = 60 - eps. The reason for the transformation
  choose(n, k) -> choose(n, n-k)
is that
  k >= k_small_max = 30
  n is treated as an integer in R_IS_INT(n)
  n-k < k_small_max
So, choose(n, n-k) is called. There, we determine that n-k is almost an integer 
and
since n-k is the second argument of choose(n,k), it is explicitly rounded to an 
integer.
Since n = 2*k - eps, we have round(n-k) = round(k - eps) = k. The result is that
we again call choose(n, k) and this repeats until the stack overflow.

For example
  > choose(60 - 1e-9, 30)
  Error: segfault from C stack overflow

Besides patch C, which corrects this stack overflow, also the previous
patches A, B from 
  https://stat.ethz.ch/pipermail/r-devel/2009-December/056154.html
correct this, but have lower accuracy.

Petr Savicky.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Associative array?

2010-03-11 Thread Petr Savicky
On Thu, Mar 11, 2010 at 06:09:09PM -0600, Ben wrote:
[...]
> > I don't quite understand - characters are (after raw vectors) the
> > most expressive data type, so I'm not quite sure why that would be a
> > limitation .. You can cast anything (but raw vector with nulls) into
> > to a character.
> 
> It's no big problem, it's just that if the solution is to convert to
> character type, then there are some implementation details to worry
> about.  For instance, I assume that as.character(x) is a reversible
> 1-1 mapping if x is an integer (and not NA or NULL, etc).  But
> apparently that isn't exactly true for floats, and it would get more
> complicated for other data types.

Let me add a comment on the question of an invertible conversion of double
values to character type. The function as.character() intentionally does
not keep the exact value and performs some rounding. However, conversion
of a double x to character type, which preserves the number exactly, may
be obtained, for example, using
  formatC(x, digits=17, width=-1)

Petr Savicky.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] .Call and .C arguments

2010-03-29 Thread Petr Savicky
On Mon, Mar 29, 2010 at 01:56:14PM +0200, Roger Bergande wrote:
...
> The passed values are not the same in C.  I 'm calling a function in C
> with the argument as.double(1204.245) but in the debug mode in C the
> value has changed to 1204.2449.

I do not see this value printed in the examples below. So, i think
you assume that 1204.245 was changed to 1204.2449 since
rounding it to 2 digits produces 1204.24. The problem is indeed a rounding
problem, but not when the number is passed to C. The number 1204.245
cannot be represented in double precision exactly. So, already at the
R level is, actually
  formatC(1204.245, digits=20) # [1] "1204.24498909"

See
  http://rwiki.sciviews.org/doku.php?id=misc:r_accuracy
or FAQ 7.31 for more examples.

Petr Savicky.

> 
> Is there a way to pass the arguments differently?
> 
> 
> 
> I'm using Windows and Visual Studio C++ 2005 Express Edition and
> R-2.10.1.
> 
> 
> 
> 
> 
> Please see the two simple examples to understand the issue.
> 
> 
> 
> # C call from R
> 
> .C("myroundC",as.double(1204.245))
> 
> 
> 
> 
> 
> // C Code
> 
> 
> 
> void myroundC(double *Amount){
> 
> 
> 
> *Amount = Rf_fround(*Amount,2);
> 
> 
> 
> }
> 
> 
> 
> #Return value in R
> 
> [[1]]
> 
> [1] 1204.24
> 
> 
> 
> 
> 
> 
> 
> # C call from R
> 
> .Call("myroundCall",as.double(1204.245))
> 
> 
> 
> // C Code
> 
> SEXP myroundCall(SEXP a){
> 
> double *ap = REAL(a), *ansp;
> 
> SEXP ans;
> 
> PROTECT(ans = allocVector(REALSXP, 1));
> 
> ansp = REAL(ans);
> 
> *ansp = Rf_fround(*ap,2);
> 
> UNPROTECT(1);
> 
> return(ans);
> 
> }
> 
> 
> 
> #Return value in R
> 
> [1] 1204.24
> 
> 
> 
> # expected value 1204.25
> 
> 
> 
> 
> 
> Thanks a lot for your help.
> 
> Best regards
> 
> Roger Bergande
> 
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] pure R code package for Windows

2007-04-28 Thread Petr Savicky
Dear R developers,

I am using R under Linux, but I would like to share
an extension package with some Windows users. The package
contains only data and .R scripts. There is no src directory.

So, I think I do not need a Windows machine with C compiler,
"make", "sh" and "perl". If I am wrong, please tell me.

I tried the following approaches (after verifying the package
using R CMD check - 1 warning concerning the missing documentation
for some of the R-functions.)

1. Installing the source package (with no C, C++ or F files)
   directly on Windows XP. Installation complains that
   "make" command is mising. OK, it is a source package.

2. Building binary package using R CMD build --binary --use-zip
   on Linux and try to install it under Windows XP. Installation
   complains that "make" command is missing. (Why, if it is 
   a binary package?).

3. Build the package from source on Windows XP using
   R CMD build . Installation complains that "sh"
   is missing. (Why is it looking for "sh", if it is a properly
   working R installation under Windows?)

4. Install the package under Linux and zip the directory
   library/ and unzip it in the library directory
   on Windows machine. This works. The package behaves
   correctly. However, I do not think that this is a suggested
   method.

Could you help me?

Thank you in advance. Petr Savicky.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] pure R code package for Windows

2007-04-30 Thread Petr Savicky
> You can also use the Windows binary build and check service at 
> http://129.217.207.166 after reading its disclaimer...

Thank you very much. This worked including generating .chm file. The resulting
package is installable under R GUI on Windows XP, which is sufficient for me.

Let me point out that I did not succeed to install any package
(including the standard ones like rpart) on Windows using R CMD INSTALL 
,
although I have Rtools and perl installed, running and in the path.

The error message for windows binary package rpart from CRAN is:
  Error in Rdinfo(RdFiles[i]) : missing/empty \name field in 
'E:/R/rpart/man/rpart.Rd.gz
  Rd files must have a non-empty \name.

> >4. Install the package under Linux and zip the directory
> >  library/ and unzip it in the library directory
> >  on Windows machine. This works. The package behaves
> >  correctly. However, I do not think that this is a suggested
> >  method.
> 
> Well, it _is_ a suggested method: see README.packages.

This also produced a package installable by the standard procedure under
Windows GUI. Thank you for pointing this out. (Again, R CMD INSTALL does
not work for the resulting package.)

All the best, Petr.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] grep with fixed=TRUE and ignore.case=TRUE

2007-05-07 Thread Petr Savicky
Dear R developers,

I suggest to modify the behaviour of "grep" function with fixed=TRUE option.

Currently, fixed=TRUE implies ignore.case=FALSE (overrides ignore.case=TRUE,
if set by the user).

I suggest to keep ignore.case as set by the user even if fixed=TRUE. Since
the default of ignore.case is FALSE, this would not change the behaviour
of grep, if the user does not set ignore.case explicitly.

In my opinion, fixed=TRUE is most useful for suppressing meta-character
expansion. On the other hand, for a simple word search, ignoring
case is sometimes useful.

If for some reason, it is better to keep the current behavior of grep, then I
suggest to extend the documentation as follows:

ORIGINAL:
   fixed: logical.  If 'TRUE', 'pattern' is a string to be matched as
  is.  Overrides all conflicting arguments.

SUGGESTED:
   fixed: logical.  If 'TRUE', 'pattern' is a string to be matched as
  is.  Overrides all conflicting arguments including ignore.case.

All the best, Petr Savicky.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] grep with fixed=TRUE and ignore.case=TRUE

2007-05-11 Thread Petr Savicky
On Wed, May 09, 2007 at 06:41:23AM +0100, Prof Brian Ripley wrote:
> I suggest you collaborate with the person who replied that he thought this 
> was a good idea to supply patches against the R-devel sources for 
> scrutiny.

A possible solution is to use strncasecmp instead of strncmp
in function fgrep_one in R-devel/src/main/character.c.

Corresponding modification of character.c is at
  http://www.cs.cas.cz/~savicky/ignore_case/character.c
and diff file w.r.t. the original character.c (downloaded today) is at
  http://www.cs.cas.cz/~savicky/ignore_case/diff.txt

This seems to work in my installation of R-devel:

  > x <- c("D.G cat", "d.g cat", "dog cat")
  > z <- "d.g"
  > grep(z, x, ignore.case = F, fixed = T)
  [1] 2
  > grep(z, x, ignore.case = T, fixed = T)  # this is the new behavior
  [1] 1 2
  > grep(z, x, ignore.case = T, fixed = F)
  [1] 1 2 3
  >

Since fgrep_one is used many times in character.c, adding igcase_opt as
an additional argument would imply extensive changes to the file.
So, I introduced a new function fgrep_one_igcase called only once in
the file. Another solution is possible.

I do not understand well handling multibyte chars, so I did not test
the function with real multibyte chars, although the code for
this option is used.

Ignore case option is not meaningfull in gsub. It could be meaningful
in regexpr, however, this function does not allow ignore.case option,
so I did no changes to it.

All the best, Petr.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Problem with rnorm ?

2007-05-12 Thread Petr Savicky
> #include 
> #include 
> 
> int main(void){
>   double x[3];
>   for(int i=0;i<3;i++)  x[i]=rnorm(0,1);
>   printf("%lf,%lf,%lf",x[0],x[1],x[2]);
>   return 0;
> }
> 
> output :   -8.773321,-8.773321,-8.773321

You probably have to call GetRNGstate() before rnorm and PutRNGstate() after.
At least for extension packages, this is necessary, see section 6.3 in 
R-exts.pdf.

Petr.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] grep with fixed=TRUE and ignore.case=TRUE

2007-05-17 Thread Petr Savicky
> strncasecmp is not standard C (not even C99), but R does have a substitute 
> for it.  Unfortunately strncasecmp is not usable with multibyte charsets: 
> Linux systems have wcsncasecmp but that is not portable.  In these days of 
> widespread use of UTF-8 that is a blocking issue, I am afraid.

What could help are the functions mbrtowc and towctrans and simple
long integer comparison. Are the functions mbrtowc and towctrans
available under Windows? mbrtowc seems to be available as Rmbrtowc
in src/gnuwin32/extra.c.

I did not find towctrans defined in R sources, but it is in gnuwin32/Rdll.hide
and used in do_tolower. Does this mean that tolower is not usable
with utf-8 under Windows?

> In the case of grep I think all you need is
> 
> grep(tolower(pattern), tolower(x), fixed = TRUE)
> 
> and similarly for regexpr.

Yes. this is correct, but it has disadvantages. It needs more
space and, if value=TRUE, we would have to do something like
   x[grep(tolower(pattern), tolower(x), fixed = TRUE, value=FALSE)]
This is hard to implement in src/library/base/R/grep.R,
where the call to .Internal(grep(pattern,...)) is the last command
and I think this should be preserved.

> >Ignore case option is not meaningfull in gsub.
> 
> sub("abc", "123", c("ABCD", "abcd"), ignore.case=TRUE)
> 
> is different from 'ignore.case=FALSE', and I see the meaning as clear.
> So what did you mean?  (Unfortunately the tolower trick does not work for 
> [g]sub.)

The meaning of ignore.case in [g]sub is problematic due to the following.
  sub("abc", "xyz", c("ABCD", "abcd"), ignore.case=TRUE)
produces
  [1] "xyzD" "xyzd"
but the user may in fact need the following
  [1] "XYZD" "xyzd"

It is correct that "xyzD" "xyzd" is produced, but the user
should be aware of the fact that several substitutions like 
  x <- sub("abc", "xyz", c("ABCD", "abcd"))   # ignore.case=FALSE
  sub("ABC", "XYZ", x)  # ignore.case=FALSE
may be more useful.

I have another question concerning the speed of grep. I expected that
fgrep_one function is slower than calling a library routine
for regular expressions. In particular, if the pattern has a lot of
long partial matches in the target string, I expected that it may be much
slower. A short example is
  y <- "ab"
  x <- "aaab"
  grep(y,x)
which requires 110 comparisons (10 comparisons for each of 11 possible
beginnings of y in x). In general, the complexity in the worst case is
O(m*n), where m,n are the lengths of y,x resp. I would expect that
the library function for matching regular expressions needs
time O(m+n) and so will be faster. However, the result obtained
on a larger example is

  > x1 <- paste(c(rep("a", times = 1000), "b"), collapse = "")
  > x2 <- paste(c("b", rep("a", times = 1000)), collapse = "")
  > y <- paste(c(rep("a", times = 1), x2), collapse = "")
  > z <- rep(y, times = 100)

  > system.time(i <- grep(x1, z, fixed = T))
  [1] 1.970 0.000 1.985 0.000 0.000

  > system.time(i <- grep(x1, z, fixed = F))   # reg. expr. surprisingly slow 
(*)
  [1] 40.374  0.003 40.381  0.000  0.000

  > system.time(i <- grep(x2, z, fixed = T))
  [1] 0.113 0.000 0.113 0.000 0.000

  > system.time(i <- grep(x2, z, fixed = F))  # reg. expr. faster than fgrep_one
  [1] 0.019 0.000 0.019 0.000 0.000

Do you have an explanation of these results, in particular (*)?

Petr.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] R 2.5.0 refuses to print enough digits to recover exact floating point values

2007-05-23 Thread Petr Savicky
> > Well, okay, now what about dump, write.table, save, etc?
> 
> save() uses the required precision. For exp(1) it stores 
> "2.718281828459045" and you will see that
> 
> exp(1) == 2.718281828459045  is TRUE
> 
save(...,ascii=TRUE) uses 16 digit precision, but this seems
not to be sufficient. In R-2.5.0, I obtained:

  > otab <- data.frame(val=1+1/(1:1000));
  > ntab <- otab;
  > save(otab,file="save.txt",ascii=TRUE);
  > rm(otab);
  > load("save.txt");
  
  > cbind(table(otab[,1]-ntab[,1]))
[,1]
  -4.44089209850063e-16  159
  -2.22044604925031e-16  220
  0  240
  2.22044604925031e-16   213
  4.44089209850063e-16   168

So, most of the numbers are changed. The changes are within the
last two bits. The last bit represents the difference
2^(-52) = 2.220446e-16.

Petr.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] R 2.5.0 refuses to print enough digits to recover exact floating point values

2007-05-23 Thread Petr Savicky
This is an addition to my previous message.

16 digits are indeed not sufficient to represent a double
value exactly. The next table is calculated in bc calculator.
It shows that if we restrict (or round) double values
to 16 decadic digits then from 4 to 5 consecutive distinct
double values get the same 16 digit decadic code.

1.234567890123456  a 16-digit number for comparison
1. 1 + 0*2^(-52)
1.0002220446049250 1 + 1*2^(-52)
1.0004440892098500 1 + 2*2^(-52)
1.0006661338147750 1 + 3*2^(-52)
1.0008881784197001 1 + 4*2^(-52)
1.0011102230246251 1 + 5*2^(-52)
1.0013322676295501 1 + 6*2^(-52)
1.0015543122344752 1 + 7*2^(-52)
1.0017763568394002 1 + 8*2^(-52)
1.0019984014443252 1 + 9*2^(-52)
1.0022204460492503 1 + 10*2^(-52)
1.0024424906541753 1 + 11*2^(-52)
1.0026645352591003 1 + 12*2^(-52)
1.0028865798640254 1 + 13*2^(-52)
1.0031086244689504 1 + 14*2^(-52)
1.0033306690738754 1 + 15*2^(-52)
1.0035527136788005 1 + 16*2^(-52)
1.0037747582837255 1 + 17*2^(-52)
1.0039968028886505 1 + 18*2^(-52)
1.0042188474935755 1 + 19*2^(-52)

Petr.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] documented/undocumented behavior of as.double(formatC(x, digits=17))

2007-05-25 Thread Petr Savicky
Some days ago, there was a discussion about the command
   formatC(exp(1),digits=100,width=-1)

Converting a double value to a string, from which the double may be
reconstructed exactly, may be useful. So, I did some experimentation
with it in my linux installation of R-2.5.0.

I generated a vector x of a large number of random doubles (random sign,
random mantissa with 53 significant bits and binary exponent from -1074:1023)
and did the following
 y <- formatC(x,digits=17,width=-1)
 z <- as.double(y)
 all(x == z)  # TRUE

On Wed, May 23, 2007 at 06:32:36PM +0100, Prof Brian Ripley wrote:
> However, the C99 standard does make clear that [sf]printf is not required
> (even as 'recommended practice') to be accurate to more than *_DIG places,
> which as ?as.character has pointed out is 15 on the machines R runs on.
> 
> It really is the case that writing out a number to > 15 significant digits
> and reading it back in again is not required to give exactly the same
> IEC60559 (sic) number, and furthermore there are R platforms for which
> this does not happen.

I think that this implies that preserving the exact double value in formatC
does not follow from C99 standard. Is there some other (e.g. platform specific)
documentation that implies this or it has to be used as an undocumented feature?
(Well, I do know that the R sources or gcc sources are such a documentation,
but I am trying to find some other).

Petr.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Determinant function (PR#9715)

2007-06-04 Thread Petr Savicky
> The function ''det'' works improperly for a singular matrix and returns a
> non-zero value even if ''solve'' reports singularity. The matrix is very 
> simple
> as shown below.
>
> A <- diag(rep(c(64,8), c(8,8)))
> A[9:16,1] <- 8
> A[1,9:16] <- 8
> 
> det(A)
> #[1] -196608
> solve(A)
> #Error in solve.default(A) : system is computationally singular: reciprocal
> condition number = 2.31296e-18

The "det" function cannot work properly in the limited precision
of floating point numbers. May be, "det" could also do a test
of computational singularity and refuse to compute the value
similarly as "solve" does.

The eigen-values of your matrix are
  > eigen(A)$values
   [1]  7.20e+01  6.40e+01  6.40e+01  6.40e+01  6.40e+01
   [6]  6.40e+01  6.40e+01  6.40e+01  8.00e+00  8.00e+00
  [11]  8.00e+00  8.00e+00  8.00e+00  8.00e+00  8.00e+00
  [16] -2.023228e-15
The last value is not zero due to rounding. The determinant is the product
of eigenvalues, so we get something large.

The problem may also be seen as follows:
  > det(A/8)
  [1] -6.98492e-10
This is close to zero as it should be, however, det(A) = det(A/8)*8^16,
and this is what we really get:
  > det(A/8)*8^16
  [1] -196608
  > det(A)
  [1] -196608

There are even matrices closer to a diagonal matrix than A with
a similar problem:
  > B <- 20*diag(16); B[1:2,1:2] <- c(25,35,35,49); det(B);
  [1] 44565.24

All the best, Petr.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] R 2.5.0 refuses to print enough digits to recover exact floating point values

2007-06-26 Thread Petr Savicky
I would like to reply to the following email from May in the thread
  [Rd] R 2.5.0 refuses to print enough digits to recover exact floating point 
values

On Wed, May 23, 2007 at 06:32:36PM +0100, Prof Brian Ripley wrote:
> I think this is a bug in the MacOS X runtime.  I've checked the C99
> standard, and can see no limits on the precision that you should be able
> to specify to printf.
> 
> Not that some protection against such OSes would come amiss.
> 
> However, the C99 standard does make clear that [sf]printf is not required 
> (even as 'recommended practice') to be accurate to more than *_DIG places, 
> which as ?as.character has pointed out is 15 on the machines R runs on.

Let me use
  http://www.open-std.org/JTC1/SC22/WG14/www/docs/n1124.pdf
In Section 7.19.6, Recommended practice, paragraph 13, it is specified
that rounding to DECIMAL_DIG should be done correctly.

In Annex F.5 Binary-decimal conversion, it is explicitly stated that
conversion from the widest supported IEC 60559 format to decimal
with DECIMAL_DIG digits and back is the identity function. In footnote 305,
it is specified that if double (53 bit precision) is the widest format
supported, then DECIMAL_DIG >= 17.

R uses double precision as default, so I think DECIMAL_DIG = 17 is a
reasonable value. This is the minimum number of decimal digits, for which
the conversion
   double -> string -> double
is value preserving, if rounding is correct.

DBL_DIG = 15 is another constant, which is the maximum number of digits,
for which the conversion
   string -> double -> string
may be done value preserving.

> It really is the case that writing out a number to > 15 significant digits 
> and reading it back in again is not required to give exactly the same 
> IEC60559 (sic) number, and furthermore there are R platforms for which 
> this does not happen.  What Mr Weinberg claimed is 'now impossible' never 
> was possible in general (and he seems to be berating the R developers for 
> not striving to do better than the C standard requires of OSes).  In fact, 
> I believe this to be impossible unless you have access to extended 
> precsion arithmetic, as otherwise printf/scanf have to use the same 
> arithmetic as the computations.
> 
> This is why R supports transfer of floating-point numbers via readBin and 
> friends, and uses a binary format itself for save() (by default).
> 
> I should also say that any algorithm that depends on that level of details 
> in the numbers will depend on the compiler used and optimization level and 
> so on.  Don't expect repeatability to that level even with binary data 
> unless you (are allowed to) never apply bugfixes to your system.

Repeatability can hardly be achieved among different R installations, due
to the reasons you explain. However, repeatability of a computation 
within the same installation may be achieved and may be sometimes useful.
For example it may be useful for debugging, similarly as set.seed. Saving
the data in binary is a possible approach for this, however, decimal numbers
in a text may serve this purpose equally well, if they are precise enough.

I would like to ask a question concerning the function scientific
in src/main/format.c, which is called from formatReal in the same file.
Function scientific, besides other things, determines,
whether a smaller number of significant digits than R_print.digits
is sufficient to get a decimal number with relative error
at most max(10^-R_print.digits,2*DBL_EPSILON), where DBL_EPSILON=2^-52.

For simplicity, let me consider only the case, when
R_print.digits = DBL_DIG (which is 15). Then the bound for the relative
error is 1e-15, since 1e-15 > 2*2^-52.

There are two error bounds in consideration:
(1) the absolute error of the rounded mantissa (significand) is at most 5e-15,
(2) the relative error of the rounded number is at most 1e-15.

It is natural to consider also (1), since R_print.digits = 15 and 5e-15 is
the maximum absolute error of the mantissa for correct rounding
to 15 significant digits.

If the mantissa is in [1,5], then (2) => (1).
Hence, for these mantissas, function scientific should suggest a smaller
number of digits than 15 only if the result is as accurate as rounding
to 15 digits.

On the other hand, if the mantissa is in (5,10), then (2) does not imply (1).
Hence, here we sometimes do not get the full precision 15 digit numbers.
Is this behaviour the intention?

This has, for example, the following consequence: if a 15-digit number
is read into R using read.table and then written back to a text using
write.table, we need not get the same result.

For testing, I use the following platforms:

  FLAGS means CFLAGS,FFLAGS,CXXFLAGS,FCFLAGS
  on Linuxes, R version is 2.5.1 RC (2007-06-23 r42041),
  on Windows, R version is binary distribution of R-2.5.1 (RC).

  [1] AMD Athlon(tm) 64 Processor 3500+, SUSE 10.1,
  gcc 4.1.0, FLAGS = -g -O2 -march=pentium4 -mfpmath=sse

  [2] AMD Athlon(tm) 64 Processor 3500+, SUSE 10.1,

[Rd] make check for 2.5.1 RC fails on mbcsToSbcs in graphics

2007-06-27 Thread Petr Savicky
configure and make run OK, but make check failed
for R version 2.5.1 RC (2007-06-26 r42068) on graphics with error:

  > ## The following two examples use latin1 characters: these may not
  > ## appear correctly (or be omitted entirely).
  > plot(1:10, 1:10, main = "text(...) examples\n~~",
  +  sub = "R is GNU Â, but not  ...")
  Error in title(...) : conversion failure in 'mbcsToSbcs'
  Execution halted

The whole tests/Examples/graphics-Ex.Rout.fail is at
  http://www.cs.cas.cz/~savicky/R-devel/graphics-Ex.Rout.fail

The end of make check report is:

  make[5]: Leaving directory `/home/petr/R/DEVEL/R-rc-2007-06-26/src/library'
  running code in 'grDevices-Ex.R' ... OK
  collecting examples for package 'graphics' ...
  make[5]: Entering directory `/home/petr/R/DEVEL/R-rc-2007-06-26/src/library'
   >>> Building/Updating help pages for package 'graphics'
   Formats: text html latex example 
  make[5]: Leaving directory `/home/petr/R/DEVEL/R-rc-2007-06-26/src/library'
  running code in 'graphics-Ex.R' ...make[4]: *** [graphics-Ex.Rout] Error 1
  make[4]: Leaving directory `/home/petr/R/DEVEL/R-rc-2007-06-26/tests/Examples'
  make[3]: *** [test-Examples-Base] Error 2
  make[3]: Leaving directory `/home/petr/R/DEVEL/R-rc-2007-06-26/tests/Examples'
  make[2]: *** [test-Examples] Error 2
  make[2]: Leaving directory `/home/petr/R/DEVEL/R-rc-2007-06-26/tests'
  make[1]: *** [test-all-basics] Error 1
  make[1]: Leaving directory `/home/petr/R/DEVEL/R-rc-2007-06-26/tests'
  make: *** [check] Error 2

Otherwise, the installation works.

  R.Version

  platform   i686-pc-linux-gnu 
  arch   i686  
  os linux-gnu 
  system i686, linux-gnu   
  status RC
  major  2 
  minor  5.1   
  year   2007  
  month  06
  day26
  svn.rev42068 
  language   R 
  version.string R version 2.5.1 RC (2007-06-26 r42068)

System SUSE LINUX 10.0, gcc version 4.0.2 20050901 (prerelease) (SUSE Linux),
default options,

  $LANG = cs_CZ.UTF-8
  $LC_MESSAGES = POSIX
  $LC_TIME = POSIX
  no other LC_* set

Petr.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] inaccuracy in qbinom with partial argument matching

2007-06-29 Thread Petr Savicky
> > ## partial argument matching:
> > qbinom(p0 , s = 3,p= 0.25)  ## 1 ???
> > qbinom(p0-0.05, s = 3,p= 0.25)  ## 1 ???
> > qbinom(p0-0.06, s = 3,p= 0.25)  ## 0 o.K.
> >
> > Unfortunately I have no I idea how to fix this.
> 
> You use a call that specifies your intentions accurately.  This is not 
> 'partial argument matching': 'p' is an exact match to the first argument 
> of
> 
> > args(qbinom)
> function (p, size, prob, lower.tail = TRUE, log.p = FALSE)
> 
> and that is how argument matching in R is documented to work.
> 
> The 'inaccuracy' is in the diagnosis: please see the FAQ.
 
Let me add an explanation, why
  qbinom(p0 , s = 3,p= 0.25)
does not produce an error message about missing "prob" argument:
Since "size" and "p" arguments are given, p0 is used for
the third argument and not for the first.

Although the behavior is logical, it may not be immediately clear.
I do not see this case explicitly in FAQ or R-intro.pdf 10.3.

Petr.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] make check for 2.5.1 RC fails on mbcsToSbcs in graphics

2007-06-30 Thread Petr Savicky
On Wed, Jun 27, 2007 at 04:20:39PM +0100, Prof Brian Ripley wrote:
> But you can't really avoid it if you want to test non-ASCII
> features: for example, you cannot test Latin-1 graphics in a C locale.
> 
> It was intended that the postscript device opened for testing graphics was 
> opened with encoding="ISOLatin1".  That should certainly have helped, at 
> least in a UTF-8 locale, but unfortunately there were a couple of typos in 
> tests/Examples/Makefile.  Fixing that file makes 'make check' work in 
> cs_CZ.utf8 at least on my machine.
> 
> It still remains that (for examples) you cannot successfully run make 
> check in a MBCS locale without iconv, nor will zh_CN.gbk nor zh_TW.big5 
> work.  I've added a note to R-admin to say that you may need to try a 
> suitable locale.

I applied the following patch obtained by comparing R-devel_2007-06-26 and 
R-devel_2007-06-27

--- D070626/R-devel/tests/Examples/Makefile.in  2007-05-10 17:50:54.0 
+0200
+++ D070627/R-devel/tests/Examples/Makefile.in  2007-06-27 17:50:09.0 
+0200
@@ -82,8 +82,8 @@
@if test -f $@; then mv $@ [EMAIL PROTECTED]; fi
@echo $(ECHO_N) "running code in '$<' ...$(ECHO_C)"
@echo 'tools:::.convert_examples("graphics-Ex.R", 
"graphics-Ex.R-locale", "latin1")' | $(R) --slave > /dev/null 2>&1 || exit 1
-   @sed -e 
's/grDevices::postscript("graphics-Ex.ps")/grDevices::postscript("graphics-Ex.ps",encoding="ISOlatin1")/'
 graphics-Ex.R-locale \
- | $(R) < graphics-Ex.R-locale  > $@ 2>&1 || (mv $@ [EMAIL PROTECTED] 
&& exit 1)
+   @sed -e 
's/grDevices::postscript("graphics-Ex.ps")/grDevices::postscript("graphics-Ex.ps",encoding="ISOLatin1")/'
 graphics-Ex.R-locale \
+ | $(R) > $@ 2>&1 || (mv $@ [EMAIL PROTECTED] && exit 1)
@rm graphics-Ex.R-locale
@echo "$(ECHO_T) OK"
@if test -f [EMAIL PROTECTED]; then \

to my copy of R-2.5.1. Then, make check works with no error in cs_CZ.UTF-8 
locale
also on my machine. Thank you for the correction.

Petr.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Shapiro Test P Value Incorrect? (PR#9768)

2007-07-06 Thread Petr Savicky
This is not a bug. The algorithm uses different approximation of the
p-value for n=3 (exact value), 4<=n<=11 and n>=12 as seen in 
  src/library/stats/src/swilk.c
below the line 202
 /*  Calculate significance level for W */

The W statistic monotonically decreases in the presented example.

Petr.

> Full_Name: Jason Polak
> Version: R version 2.5.0 (2007-04-23)
> OS: Xubuntu 7.04
> Submission from: (NULL) (137.122.144.35)
> 
> 
> Dear R group,
> 
> I have noticed a strange anomaly with the shapiro.test() function. 
> Unfortunately
> I do not know how to calculate the shapiro test P values manually so I don't
> know if this is an actual bug.
> 
> So, to produce the results, run the following code:
> 
> pvalues = 0;
> for (i in 1:17)
> {
>   j = 1:(i+3);
>   pvalues[i]=shapiro.test(j)$p;
> }
> 
> plot(pvalues);
> print(pvalues);
> 
> Now I just made the graph to illustrate that the p-values are strictly
> decreasing. To me this makes intuitive sense: we are using the Shapiro test to
> test normality of (1,2,3,4),(1,2,3,4,5), and so on. So the p-value should
> decrease.
> 
> These are the p-values:
>  [1] 0.9718776 0.9671740 0.9605557 0.9492892 0.9331653 0.9135602 0.8923668
>  [8] 0.8698419 0.8757315 0.8371814 0.7964400 0.7545289 0.7123167 0.6704457
> [15] 0.6294307 0.5896464 0.5513749
> 
> However, there is an increase in p-value when you go from (1,..,11) to
> (1,..,12). Is this just a quirk of the Shapiro test, or is there an error in 
> the
> calculation algorithm?
> 
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] sweep sanity checking?

2007-07-12 Thread Petr Savicky
The suggestion sounds reasonable to me. Let me add that sweep is written
to work even if MARGIN includes more than one dimension. To handle these
cases correctly, the test may be replaced e.g. by
 if (check.margin && prod(dims[MARGIN])!=length(STATS)) {
   warning("length(STATS) != prod(dim(x)[MARGIN])")
 } else if (prod(dims[MARGIN]) %% length(STATS)!=0)
   warning("prod(dim(x)[MARGIN]) is not a multiple of length(STATS)")
or even by
 dimstat <- if (is.null(dim(STATS))) length(STATS) else dim(STATS)
 if (check.margin && any(dims[MARGIN]!=dimstat)) {
   warning("length(STATS) or dim(STAT) do not match dim(x)[MARGIN]")
 } else if (prod(dims[MARGIN]) %% length(STATS)!=0)
   warning("prod(dim(x)[MARGIN]) is not a multiple of length(STATS)")

Petr.

> Just an opinion from an R user: I think it's a sound idea.  I use my own 
> version of sweep with a stricter check: it stops if the vector is not 
> exactly the right length.
> 
> -- Tony Plate
> 
> Ben Bolker wrote:
> > Ben Bolker  zoo.ufl.edu> writes:
> >
> >   
> >>   What would R-core think of the following 'enhanced'
> >> sweep?  
> >> 
> >
> >  (now posted at
> > http://wiki.r-project.org/rwiki/doku.php?id=rdoc:base:sweep
> > )
> >
> > It always warns if dim(x)[MARGIN] is
> >

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] sweep sanity checking?

2007-07-12 Thread Petr Savicky
I am sorry for an incomplete proposal. The stricter check
 if (check.margin && any(dims[MARGIN]!=dimstat)) {
was meant to be
 if (check.margin && (length(dimstat)!=length(MARGIN) || 
any(dims[MARGIN]!=dimstat))) {

Petr.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] sweep sanity checking?

2007-07-13 Thread Petr Savicky
I would like to suggest a replacement for the curent function
sweep based on the two previous suggestions posted at
  https://stat.ethz.ch/pipermail/r-help/2005-June/073989.html
and
  http://wiki.r-project.org/rwiki/doku.php?id=rdoc:base:sweep
with some extensions.

My suggestion is to use one of the following two variants. They
differ in whether length(STATS) == 1 is allowed without a warning
in the stricter (default) check or not. I don't know, what is better.

  sweep1 <- function (x, MARGIN, STATS, FUN = "-", check.margin=TRUE, ...)
  {
  FUN <- match.fun(FUN)
  dims <- dim(x)
  dimstat <- if (is.null(dim(STATS))) length(STATS) else dim(STATS)
  if (length(MARGIN) < length(dimstat)) {
  STATS <- drop(STATS)
  dimstat <- if (is.null(dim(STATS))) length(STATS) else dim(STATS)
  }
  if (check.margin && length(STATS) > 1 &&
  (length(dimstat)!=length(MARGIN) || any(dims[MARGIN]!=dimstat))) {
  warning("length(STATS) or dim(STAT) do not match dim(x)[MARGIN]")
  } else if (prod(dims[MARGIN]) %% length(STATS)!=0)
  warning("prod(dim(x)[MARGIN]) is not a multiple of length(STATS)")
  perm <- c(MARGIN, (1:length(dims))[-MARGIN])
  FUN(x, aperm(array(STATS, dims[perm]), order(perm)), ...)
  }

  sweep2 <- function (x, MARGIN, STATS, FUN = "-", check.margin=TRUE, ...)
  {
  FUN <- match.fun(FUN)
  dims <- dim(x)
  dimstat <- if (is.null(dim(STATS))) length(STATS) else dim(STATS)
  if (length(MARGIN) < length(dimstat)) {
  STATS <- drop(STATS)
  dimstat <- if (is.null(dim(STATS))) length(STATS) else dim(STATS)
  }
  if (check.margin &&
  (length(dimstat)!=length(MARGIN) || any(dims[MARGIN]!=dimstat))) {
  warning("length(STATS) or dim(STAT) do not match dim(x)[MARGIN]")
  } else if (prod(dims[MARGIN]) %% length(STATS)!=0)
  warning("prod(dim(x)[MARGIN]) is not a multiple of length(STATS)")
  perm <- c(MARGIN, (1:length(dims))[-MARGIN])
  FUN(x, aperm(array(STATS, dims[perm]), order(perm)), ...)
  }

The functions are tested on the following examples.

  a <- array(1:64,dim=c(4,4,4))
  M <- c(1,3);

  b  sweep1(a,M,b) sweep2(a,M,b) sweep1(a,M,b,c=F) sweep2(a,M,b,c=F)

  a[,2,] - - - - 
  a[1:2,2,]  warning   warning   - - 
  1  - warning   - -
  1:3warning   warning   warning   warning
  1:16   warning   warning   - -

  a <- matrix(1:8,nrow=4,ncol=2);
  M <- 1;

  b  sweep1(a,M,b) sweep2(a,M,b) sweep1(a,M,b,c=F) sweep2(a,M,b,c=F)

  1  - warning   - -
  1:2warning   warning   - -
  1:3warning   warning   warning   warning
  1:4- - - -
  matrix(1:4,nrow=1) # (A)
 - - - -
  matrix(1:4,ncol=1) # (B)
 - - - -

  a <- matrix(1:8,nrow=4,ncol=2);
  M <- 2;

  b  sweep1(a,M,b) sweep2(a,M,b) sweep1(a,M,b,c=F) sweep2(a,M,b,c=F)

  1  - warning   - -
  1:2- - - -
  1:3warning   warning   warning   warning
  1:4warning   warning   warning   warning
  matrix(1:2,ncol=1) # (A)
 - - - -
  matrix(1:2,nrow=1) # (B)
 - - - -

Note that cases marked (A) do not generate a warning, although they
should. This is the cost of using drop(STATS), which allows cases
marked (B) without a warning. In my opinion, cases (B) make sense.

Reproducing the tests is possible using the script
  http://www.cs.cas.cz/~savicky/R-devel/verify_sweep.R

Petr.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] sweep sanity checking?

2007-07-25 Thread Petr Savicky
I would like to suggest a patch against R-devel-2007-07-24, which
modifies function sweep by including a warning, if dim(STATS) is not
consistent with dim(x)[MARGIN]. If check.margin=FALSE, the simple test whether
prod(dim(x)[MARGIN]) is a multiple of length(STATS) is performed.
If check.margin=TRUE, then a more restrictive test is used, but a limited
recycling is still allowed without warning. Besides generating a warning
in some situations, there is no other change in the behavior of sweep.
The patch is:

--- R-devel_2007-07-24/src/library/base/R/sweep.R   2007-01-04 
17:51:53.0 +0100
+++ R-devel_2007-07-24-sweep/src/library/base/R/sweep.R 2007-07-24 
10:56:18.0 +0200
@@ -1,7 +1,21 @@
-sweep <- function(x, MARGIN, STATS, FUN = "-", ...)
+sweep <- function(x, MARGIN, STATS, FUN = "-", check.margin=FALSE, ...)
 {
 FUN <- match.fun(FUN)
 dims <- dim(x)
+if (check.margin) {
+dimmargin <- dims[MARGIN]
+dimmargin <- dimmargin[dimmargin > 1]
+dimstats <- if (is.null(dim(STATS))) length(STATS) else dim(STATS)
+dimstats <- dimstats[dimstats > 1]
+if (length(dimstats) > length(dimmargin) ||
+any(dimstats != dimmargin[seq(along.with=dimstats)]))
+warning("length(STATS) or dim(STATS) do not match dim(x)[MARGIN]")
+} else if (prod(dims[MARGIN]) %% length(STATS) != 0) {
+if (length(MARGIN) == 1)
+warning("dim(x)[MARGIN] is not a multiple of length(STATS)")
+else
+warning("prod(dim(x)[MARGIN]) is not a multiple of length(STATS)")
+}
 perm <- c(MARGIN, (1:length(dims))[ - MARGIN])
 FUN(x, aperm(array(STATS, dims[perm]), order(perm)), ...)
 }

The patch uses the default check.margin=FALSE, since this is more backward
compatible. Changing the default to check.margin=TRUE would also be fine
with me and also with Ben Bolker, who told me this in a separate email.

Let me include more comments on the stricter test. If check.margin=TRUE,
then the patch tests whether (after deleting possible dimensions with
only one level) dim(STATS) is a prefix of dim(x)[MARGIN]. Hence, for example,
if dim(x)[MARGIN] = c(k1,k2), the cases
  length(STATS) = 1,
  dim(STATS) = k1,
  dim(STATS) = NULL and length(STATS) = k1,
  dim(STATS) = c(k1,k2)
are accepted without warning. On the other hand, if k1 != k2, then,
for example, dim(STATS)= k2, dim(STATS) = c(k2,k1) generate
a warning, although the simple divisibility condition
   length(STATS) divides prod(dim(x)[MARGIN])
is satisfied. The warning is generated, since in the last two cases,
recycling produces incorrect or at least suspicious result.

In the simplest case, when length(MARGIN)=1 and STATS is a vector,
the cases accepted by the stricter test without warning are exactly the
following two: length(STATS) = 1, length(STATS) = dim(x)[MARGIN].

I tested the patch using the script
  http://www.cs.cas.cz/~savicky/R-devel/verify_sweep1.R
Ben Bolker also tested the patch in his environment.

I appreciate to know the opinion of R core developers on this patch.
Thank you in advance.

Petr.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] sweep sanity checking?

2007-07-27 Thread Petr Savicky
When I was preparing the patch of sweep submitted on July 25, I was
unaware of the code by Heather Turner. She suggested a very elegant
solution, if STATS is a vector and we want to use meaningful recycling
in full generality. I would like to suggest a combined solution, which
uses Heather Turner's algorithm if check.margin=FALSE (default) and STATS
is a vector and my previous algorithm, if check.margin=TRUE or STATS is
an array. The suggestion is

  # combined from the original code of sweep without warnings and from
  # https://stat.ethz.ch/pipermail/r-help/2005-June/073989.html by Robin Hankin
  # https://stat.ethz.ch/pipermail/r-help/2005-June/074001.html by Heather 
Turner
  # https://stat.ethz.ch/pipermail/r-devel/2007-June/046217.html by Ben Bolker
  # with some further modifications by Petr Savicky
  sweep <- function(x, MARGIN, STATS, FUN = "-", check.margin=FALSE, ...)
  {
  FUN <- match.fun(FUN)
  dims <- dim(x)
  dimmargin <- dims[MARGIN]
  if (is.null(dim(STATS))) {
  dimstats <- length(STATS)
  } else {
  dimstats <- dim(STATS)
  check.margin <- TRUE
  }
  s <- length(STATS)
  if (s > prod(dimmargin)) {
  warning("length of STATS greater than the extent of dim(x)[MARGIN]")
  } else if (check.margin) {
  dimmargin <- dimmargin[dimmargin > 1]
  dimstats <- dimstats[dimstats > 1]
  if (length(dimstats) > length(dimmargin) ||
  any(dimstats != dimmargin[seq(along.with=dimstats)]))
  warning("length(STATS) or dim(STATS) do not match dim(x)[MARGIN]")
  } else {
  cumDim <- c(1, cumprod(dimmargin))
  upper <- min(cumDim[cumDim >= s])
  lower <- max(cumDim[cumDim <= s])
  if (upper %% s != 0 || s %% lower != 0)
  warning("STATS does not recycle exactly across MARGIN")
  }
  perm <- c(MARGIN, (1:length(dims))[ - MARGIN])
  FUN(x, aperm(array(STATS, dims[perm]), order(perm)), ...)
  }

Heather presented four examples testing her code:
  sweep(array(1:24, dim = c(4,3,2)), 1, 1:2)# no warning
  sweep(array(1:24, dim = c(4,3,2)), 1, 1:12)   # no warning
  sweep(array(1:24, dim = c(4,3,2)), 1, 1:24)   # no warning
  sweep(array(1:24, dim = c(4,3,2)), 1:2, 1:3)  # warning
The second and third example are not really correct, since STATS extends
also to dimensions not included in MARGIN. The problem is better visible
for example in
  sweep(array(1:24, dim = c(4,4,3,3,2,2)), c(1,3), 1:12)
where MARGIN clearly has to contain two dimensions explicitly.
So, I use the examples with a larger margin corresponding to STATS
as follows
  sweep(array(1:24, dim = c(4,3,2)), 1, 1:2)# no warning
  sweep(array(1:24, dim = c(4,3,2)), 1:2, 1:12) # no warning
  sweep(array(1:24, dim = c(4,3,2)), 1:3, 1:24) # no warning
  sweep(array(1:24, dim = c(4,3,2)), 1:2, 1:3)  # warning
The current proposal for sweep indeed gives no warning in the first
three examples and gives a warning in the last one.

I did not use the suggestion to call the option warn with default
warn = getOption("warn"). The reason is that there are two
different decisions:
 (1) whether to generate a warning
 (2) what to do with the warning, if it is generated.
The warn option influences (2): the warning may be suppressed,
printed after the return to the top level, printed immediately or
it may be converted to an error. I think that the option
controling (2) should not be mixed with an option which
controls (1). If R has an option controling to which extent
recycling is allowed, then this could be used, but warn
has a different purpose.

I appreciate feedback.

Petr.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Optimization in R

2007-08-05 Thread Petr Savicky
I would like to add a remark and a question.

Remark.

There is a part of R, which allows the user to select
among several methods for the same task and also to add
his own C code: random number generation. However, the interface
for optimization is more complex. In my opinion, looking
for a unified interface for this is desirable, but it is
a research problem, not a suggestion for an immediate
code modification.

Question.

Is there a way how to optimize a function written in C
using optim? This would be very useful, if the optimization
needs a lot of iterations. This may be done by defining 
an R function, which does nothing more than calling .C with
appropriate parameters,
but this looses efficiency. A more efficient solution
could be adding a specified entry point (or several, if derivatives
are also available), similar as in the user defined random number
generator. Then, a parameter of optim could control, whether
the function to be optimized is fn or the C entry point.

Petr Savicky.

> I don't have an example of that but that does not make it less
> desirable.  If one wants to use method 1, 2 or 3 then one can
> use optim with a method= but if one wants to use methods 4
> or 5 then one must use an entirely different function.  Surely
> it would be better to be consistent from the user's viewpoint
> and allow all of them to work consistently through the same
> interface.
> 
> On 8/4/07, Duncan Murdoch <[EMAIL PROTECTED]> wrote:
> > On 04/08/2007 2:53 PM, Gabor Grothendieck wrote:
> > > The example of generic functions.
> >
> > Show me an example where we have a list of ways to do a calculation
> > passed as an argument (analogous to the method argument of optim), where
> > the user is allowed to add his own function to the list.
> >
> > Duncan Murdoch
> > >

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Optimization in R

2007-08-05 Thread Petr Savicky
I am sorry for omitting a citation in my previous post.
The complete message is as follows (my text unchanged). PS

I would like to add a remark and a question.

Remark.

There is a part of R, which allows the user to select
among several methods for the same task and also to add
his own C code: random number generation. However, the interface
for optimization is more complex. In my opinion, looking
for a unified interface for this is desirable, but it is
a research problem, not a suggestion for an immediate
code modification.

Question.

Is there a way how to optimize a function written in C
using optim? This would be very useful, if the optimization
needs a lot of iterations. This may be done by defining 
an R function, which does nothing more than calling .C with
appropriate parameters,
but this looses efficiency. A more efficient solution
could be adding a specified entry point (or several, if derivatives
are also available), similar as in the user defined random number
generator. Then, a parameter of optim could control, whether
the function to be optimized is fn or the C entry point.

Petr Savicky.

On Sat, Aug 04, 2007 at 06:56:47PM -0400, Gabor Grothendieck wrote:
> I don't have an example of that but that does not make it less
> desirable.  If one wants to use method 1, 2 or 3 then one can
> use optim with a method= but if one wants to use methods 4
> or 5 then one must use an entirely different function.  Surely
> it would be better to be consistent from the user's viewpoint
> and allow all of them to work consistently through the same
> interface.
> 
> On 8/4/07, Duncan Murdoch <[EMAIL PROTECTED]> wrote:
> > On 04/08/2007 2:53 PM, Gabor Grothendieck wrote:
> > > The example of generic functions.
> >
> > Show me an example where we have a list of ways to do a calculation
> > passed as an argument (analogous to the method argument of optim), where
> > the user is allowed to add his own function to the list.
> >
> > Duncan Murdoch
> > >

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Optimization in R

2007-08-06 Thread Petr Savicky
Thank you for your response. This is a good idea. Although I use
my own packages, some of them using other R API's, I missed
the optimization ones. Thanks again.

Petr Savicky.

On Mon, Aug 06, 2007 at 07:16:11AM -0700, Thomas Lumley wrote:
> On Mon, 6 Aug 2007, Petr Savicky wrote:
> 
> >Question.
> >
> >Is there a way how to optimize a function written in C
> >using optim?
> 
> The algorithms used by optim are all accessible from C. The manual 
> "Writing R Extensions" has a section on "The R API", including the 
> optimization routines.
> 
>   -thomas
>

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] sweep sanity checking?

2007-08-07 Thread Petr Savicky
Thanks to Martin Maechler for his comments, advice and for pointing
out the speed problem. Thanks also to Ben Bolker for tests of speed,
which confirm that for small arrays, a slow down by a factor of about
1.2 - 1.5 may occur. Now, I would like to present a new version of sweep,
which is simpler and has an option to avoid the test. This is expected
to be used in scripts, where the programmer is quite sure that the
usage is correct and speed is required. The new version differs from
the previous one in the following:

1. The option check.margin has a different meaning. It defaults to TRUE
   and it determines whether the test is performed or not.

2. Since check.margin has the meaning above, it cannot be used
   to select, which test should be performed. This depends on the
   type of STATS. The suggested sweep function contains two tests:
   - a vector test by Heather Turner, which is used, if STATS 
 has no dim attribute and, hence, is a vector (STATS should
 not be anything else than a vector or an array)
   - an array test used if STATS has dim attribute.
   The vector test allows some kinds of recycling, while the array test
   does not. Hence, in the most common case, where x is a matrix
   and STATS is a vector, if the user likes to be warned if the length
   of the vector is not exactly the right one, the following call is
   suggested: sweep(x,MARGIN,as.array(STATS)). Otherwise, a warning
   will be generated only if length(STATS) does not divide the specified
   dimension of x, which is nrow(x) (MARGIN=1) or ncol(x) (MARGIN=2).

3. If STATS is an array, then the test is more restrictive than in
   the previous version. It is now required that after deleting
   dimensions with one level, the remaining dimensions coincide.
   The previous version allowed additionally the cases, when dim(STATS)
   is a prefix of dim(x)[MARGIN], for example, if dim(STATS) = k1 and
   dim(x)[MARGIN] = c(k1,k2).

The code of the tests in the suggested sweep is based on the previous 
suggestions
 https://stat.ethz.ch/pipermail/r-help/2005-June/073989.html by Robin Hankin
 https://stat.ethz.ch/pipermail/r-help/2005-June/074001.html by Heather Turner
 https://stat.ethz.ch/pipermail/r-devel/2007-June/046217.html by Ben Bolker
with some further modifications.

The modification of sweep.Rd was prepared by Ben Bolker and me.

I would like to encourage everybody who likes to express his opinion
on the patch to do it now. In my opinion, the suggestion of the
new code stabilized in the sense that I will not modify it unless
there is a negative feedback.

A patch against R-devel_2007-08-06 is attached. It contains tabs. If they
are corrupted by email transfer, use the link
  http://www.cs.cas.cz/~savicky/R-devel/patch-sweep
which is an identical copy.

Petr Savicky.


--- R-devel_2007-08-06/src/library/base/R/sweep.R   2007-07-27 
17:51:13.0 +0200
+++ R-devel_2007-08-06-sweep/src/library/base/R/sweep.R 2007-08-07 
10:30:12.383672960 +0200
@@ -14,10 +14,29 @@
 #  A copy of the GNU General Public License is available at
 #  http://www.r-project.org/Licenses/
 
-sweep <- function(x, MARGIN, STATS, FUN = "-", ...)
+sweep <- function(x, MARGIN, STATS, FUN = "-", check.margin=TRUE, ...)
 {
 FUN <- match.fun(FUN)
 dims <- dim(x)
+   if (check.margin) {
+   dimmargin <- dims[MARGIN]
+   dimstats <- dim(STATS)
+   lstats <- length(STATS)
+   if (lstats > prod(dimmargin)) {
+   warning("length of STATS greater than the extent of 
dim(x)[MARGIN]")
+   } else if (is.null(dimstats)) { # STATS is a vector
+   cumDim <- c(1, cumprod(dimmargin))
+   upper <- min(cumDim[cumDim >= lstats])
+   lower <- max(cumDim[cumDim <= lstats])
+   if (upper %% lstats != 0 || lstats %% lower != 0)
+   warning("STATS does not recycle exactly across 
MARGIN")
+   } else {
+   dimmargin <- dimmargin[dimmargin > 1]
+   dimstats <- dimstats[dimstats > 1]
+   if (length(dimstats) != length(dimmargin) || 
any(dimstats != dimmargin))
+   warning("length(STATS) or dim(STATS) do not 
match dim(x)[MARGIN]")
+   }
+   }
 perm <- c(MARGIN, (1:length(dims))[ - MARGIN])
 FUN(x, aperm(array(STATS, dims[perm]), order(perm)), ...)
 }
--- R-devel_2007-08-06/src/library/base/man/sweep.Rd2007-07-27 
17:51:35.0 +0200
+++ R-devel_2007-08-06-sweep/src/library/base/man/sweep.Rd  2007-08-07 
10:29:45.517757200 +0200
@@ -11,7 +11,7 @@
   statistic.
 }
 \usage{
-sweep(x, MARGIN, STATS, FUN="-", \dots)
+sweep(x, MARGIN, STATS

Re: [Rd] [Fwd: behavior of L-BFGS-B with trivial function triggers bug in stats4::mle]

2007-08-14 Thread Petr Savicky
On Mon, Aug 13, 2007 at 05:49:51PM -0400, Ben Bolker wrote:
[snip]
> an undefined condition), but it leads to a bug in stats4::mle --
> a spurious error saying that a better fit
> has been found during profiling if one tries to profile
> a 1-parameter model that was originally fitted with "L-BFGS-B".
[snip]

Could you also include a script, which reproduces the problem? Just
to see under which conditions the problem occurs and how it
looks like exactly.

Petr Savicky.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] paste() with NAs .. change worth persuing?

2007-08-23 Thread Petr Savicky
On Wed, Aug 22, 2007 at 08:53:39PM +0300, Jari Oksanen wrote:
> 
> On 22 Aug 2007, at 20:16, Duncan Murdoch wrote:
> > A fairly common use of paste is to put together reports for human
> > consumption.  Currently we have
> >
> >> p <- as.character(NA)
> >> paste("the value of p is", p)
> > [1] "the value of p is NA"
> >
> > which looks reasonable. Would this become
> >
> >> p <- as.character(NA)
> >> paste("the value of p is", p)
> > [1] NA
> >
> > under your proposal?  (In a quick search I was unable to find a real
> > example where this would happen, but it would worry me...)
> 
> At least stop() seems to include such a case:
> 
>   message <- paste(args, collapse = "")
> 
> and we may expect there are NAs sometimes in stop().

The examples show, that changing the behavior of paste in general
may not be appropriate. On the other hand, if we concatenate
character vectors, which are part of data, then is.na(paste(...,NA,...))
makes sense. Character vectors in data are usually represented
by factors. On the other hand, factors are not typical in cases,
where paste is used to produce a readable message. Hence, it
could be possible to have is.na(u[i]) for those i, for which
some of the vectors v1, ..., vn in
  u <- paste(v1,,vn)
is a factor and has NA at i-th position.

Petr Savicky.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


  1   2   >