[Rd] opening graphics device with a given number
I am preparing a script, which uses three graphical windows for different types of plots. Before each plot, the device can be chosen by dev.set(k) for k = 2, 3, 4. The function dev.set(k) does not create the required device, if it is not opened already. So, instead, i use dev.add(k) defined as dev.add <- function(k) { stopifnot(k >= 2 && k <= 63) while (all(dev.list() != k)) { dev.new() } dev.set(k) } If the device k was not opened, then all the devices with numbers smaller than k, which were not opened already, are created. Usually, i use an interval of the device numbers, but sometimes, it could be useful to open, say, device 4, even if device 3 is not needed. Is there a way, how to open a device with a given number without opening devices with smaller numbers? Petr Savicky. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] opening graphics device with a given number
On Sat, Nov 13, 2010 at 05:09:56AM -0800, Henrik Bengtsson wrote: > See devSet() in R.utils, e.g. > > > library("R.utils"); > > graphics.off(); > > devSet(6); > > devSet("myFigure"); # Also possible > > print(devList()); > myFigure Device 6 >26 The function devSet() works fine and i will use it. Thank you very much for this information. Petr Savicky. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Wait for user input with readline()
On Mon, Dec 06, 2010 at 08:25:39AM -0800, Alexandre wrote: > > Hi, > > I have a similar problem as the one of Nate. The point is that I want to > design an interactive script that need the value of two variables (x and y). > > So my script as designed for the moment is : > > x <- as.numeric (readline(prompt="What is the value of x? ")) > y <- as.numeric (readline(prompt="What is the value of y? ")) > > x > y > > But the problem is that if I run this script, values returned for x and y > will be "NA" like you can see below : How do you call your code? Function readline() does not wait for user input in a non-interactive session, for example R CMD BATCH or Rscript. Another situation, when readline() does not wait is, when you copy a block of code and paste it to a running session, even if it is interactive. If readline() is not the last line of the code, then the next line of code is used instead of the user input. Petr Savicky. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] print(...,digits=2) behavior
On Mon, Feb 07, 2011 at 09:56:18AM +0100, Martin Maechler wrote: > >>>>> Ben Bolker > >>>>> on Sat, 5 Feb 2011 15:58:09 -0500 writes: > > > A bug was recently posted to the R bug database (which > > probably would better have been posted as a query here) as > > to why this happens: > > >> print(7.921,digits=2) > > [1] 8 > >> print(7.92,digits=2) > > [1] 7.9 > > > Two things I *haven't* done to help make sense of this > > for myself are (1) writing out the binary representations > > to see if something obvious pops out about why this would > > be a breakpoint and (2) poking in the source code (I did a > > little bit of this but gave up). > > > I know that confusion over rounding etc. is very common, > > and my first reaction to this sort of question is always > > that there must be some sort of confusion (either (1) in a > > FAQ 7.31-ish sort of way that floating point values have > > finite precision in the first place, or (2) a confusion > > over the difference between the value and the > > representation of the number, or (3) more subtly, about > > the differences among various choices of rounding > > conventions). > > > However, in this case I am a bit stumped: I don't see > > that any of the standard confusions apply. I grepped the > > R manuals for "rounding" and didn't find anything useful > > ... I have a very strong prior that such a core part of R > > must be correct, and that therefore I (and the original > > bug reporter) must be misunderstanding something. > > > Thoughts/references? > > I had started to delve into this after you've posted the bug > report. It is clearly a bug(let), > caused by code that has been in R from its very > beginning, at least in the first source code I've seen in 1994. > > The problem is not related to digits=2, > but using such a low number of digits shows it more > dramatically, e.g., also > > > print(5.9994, digits=4) > [1] 5.999 > > print(5.9994001, digits=4) > [1] 6 I think that this may be a consequence of the following analysis, which i did some time ago using the source code of scientific() function in src/main/format.c. The conclusion was the following. Printing to k digits looks for the smallest number of digits l <= k so that the relative error of the printed mantissa (significand) is at most 10^-k. If the mantissa begins with a digit less than 5, then this condition is stronger than having absolute error at most 5*10^-k. So, in this case, we cannot get lower accuracy than rounding to k digits. If the mantissa begins with 5 or more, then having relative error at most 10^-k is a weaker condition than having absolute error at most 5*10^-k. In this case, the chosen l may be smaller than k even if the printed number has larger error than the number rounded to k digits. This may be seen in the following example xx <- 8 + (6:10)*10^-7 for (x in xx) print(x) [1] 8 [1] 8 [1] 8 [1] 8.01 [1] 8.01 where all the printed numbers are 8.01 if rounded to 7 digits. The cases print(7.921,digits=2) [1] 8 print(7.92,digits=2) [1] 7.9 seem to have the same explanation. The relative errors of the numbers rounded to a single digit are (8 - 7.921)/7.921 [1] 0.009973488 (8 - 7.92)/7.92 [1] 0.01010101 In the first case, this is less than 10^-2 and so, l = 1 is used. In the second case, the relative error for l = 1 is larger than 10^-2 an so, l = 2 is chosen. In the cases print(5.9994, digits=4) [1] 5.999 print(5.9994001, digits=4) [1] 6 the relative errors of one digit output are (6 - 5.9994)/5.9994 [1] 0.00010001 (6 - 5.9994001)/5.9994001 [1] 9.999333e-05 Here, one digit output is chosen if the relative error is less than 10^-4 and not otherwise. Petr Savicky. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] print(...,digits=2) behavior
On Mon, Feb 07, 2011 at 09:56:18AM +0100, Martin Maechler wrote: > >>>>> Ben Bolker > >>>>> on Sat, 5 Feb 2011 15:58:09 -0500 writes: > > > A bug was recently posted to the R bug database (which > > probably would better have been posted as a query here) as > > to why this happens: > > >> print(7.921,digits=2) > > [1] 8 > >> print(7.92,digits=2) > > [1] 7.9 > [...] > I had started to delve into this after you've posted the bug > report. It is clearly a bug(let), > caused by code that has been in R from its very > beginning, at least in the first source code I've seen in 1994. > > The problem is not related to digits=2, > but using such a low number of digits shows it more > dramatically, e.g., also > > > print(5.9994, digits=4) > [1] 5.999 > > print(5.9994001, digits=4) > [1] 6 > > Interestingly, the problem seems *not* to be present for > digits = 1 (only). > > I haven't found time to mathematically "analyze" it for > determining a correct solution though. > Note that fixing this bug(let) will probably (very slightly) > change a huge number of R outputs .. so there is a caveat, > but nonetheless, we must approach it. > > The responsible code is the C function scientific() > in src/main/format.c I inspected the source of scientific() and formatReal() (2.13.0, revision 2011-02-08 r54284). Let me point out an example of the difference between the output of print() and rounding to "digits" significant digits, which is slightly different from the previous ones, since also the exponent changes. Namely, print(9.991, digits=3) [1] 10 while rounding to 3 digits yields 9.99. The reason is in scientific(), where the situation that rounding increases the exponent is tested using /* make sure that alpha is in [1,10) AFTER rounding */ if (10.0 - alpha < eps*alpha) { alpha /= 10.0; kp += 1; } Here, eps is determined in formatReal() as double eps = pow(10.0, -(double)R_print.digits); so we have eps = 10^-digits. The above condition on alpha is equivalent to alpha > 10.0/(1 + 10^-digits) For digits=3, this is alpha > 9.99000999000999 This bound may be verified as follows print(9.9900099900, digits=3) [1] 9.99 print(9.9900099901, digits=3) [1] 10 The existing algorithm for choosing the number of digits is designed to predict the format suitable for all numbers in a vector before the actual call of sprintf() or snprintf(). For speed, this algorithm should use the standard double precision, so it is necessarily inaccurate for precision 15 and more and there may be some rare such cases also for smaller precisions. For smaller precisions, say below 7, the algorithm can be made more precise. This would change the output in rare cases and mainly for printing single numbers. If a vector is printed, then the format is typically not determined by the problematic numbers. Changing the default behavior may be undesirable for backward compatibility reasons. If this is the case, then a possible solution is to make the documentation more precise on this and include pointers to possible solutions. The functions sprintf(), formatC() and signif() may be used. In particular, if signif() is used to round the numbers before printing, then we get the correct output print(signif(7.921, digits=2)) [1] 7.9 print(signif(9.9900099901, digits=3)) [1] 9.99 The current ?print.default contains digits: a non-null value for ‘digits’ specifies the minimum number of significant digits to be printed in values. The word "minimum" here means that all numbers in the vector will have at least the chosen number of digits, but some may have more. I suggest to add "See 'options' for more detail.". The current ?options contains ‘digits’: controls the number of digits to print when printing numeric values. It is a suggestion only. I suggest to extend this to It is a suggestion only and the actual number of printed digits may be smaller, if the relative error of the output number is less than 10^-digits. Use 'signif(, digits)' before printing to get the exact number of the printed significant digits. I appreciate to know the opinion of R developers on this. Petr Savicky. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] R 2.12.1 Windows 32bit and 64bit - are numerical differences expected?
On Thu, Feb 10, 2011 at 10:37:09PM +1100, Graham Williams wrote: > Should one expect minor numerical differences between 64bit and 32bit R on > Windows? Hunting around the lists I've not been able to find a definitive > answer yet. Seems plausible using different precision arithmetic, but waned > to confirm from those who might know for sure. One of the sources for the difference between platforms are different settings of the compiler. On Intel processors, the options may influence, whether the registers use 80 bit or 64 bit representation of floating point numbers. In memory, it is always 64 bit. Testing, whether there is a difference between registers and memory may be done for example using the code #include #define n 3 int main(int agc, char *argv[]) { double x[n]; int i; for (i=0; ihttps://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] print(...,digits=2) behavior
On Mon, Feb 07, 2011 at 09:56:18AM +0100, Martin Maechler wrote: [...] > Namely, one important purpose of the test is to ensure that e.g. > > print(3.597, digits = 3) > > is printed as 3.6 and not 3.60 > > Now I have had -- since 1997 at least -- an R version of > scientific() for more easy experimentation. > Here's an updated version of that: [...] > > Now, I'm waiting for comments/suggestions .. In a previous email, i discussed the case print(9.991, digits=3). This case is already handled in your R level experimental function scientific(). I noticed this only after sending the previous email. I would like to suggest a modification of the algorithm. The purpose is to determine the minimum number of digits, such that the printed number has the same value as the number rounded to exactly getOption("digits") digits, but shorter representation. Algorithm. First, the mantissa of the number rounded to "digits" digits is computed as an integer by the formula a <- floor(alpha*10^(digits - 1) + 0.5) The obtained integer number a satisfies 10^(digits-1) <= a <= 10^digits The case a = 10^digits corresponds to the situation that we have to increase the exponent. This may be tested either before the remaining part of the code or as the last step as below. For example, if alpha = 3.597 and digits = 3, we get a = 360. The question, whether (digits - 1) digits are sufficient for printing the number, is equivalent to the condition that "a" is divisible by 10. Similarly, (digits - 2) digits are sufficient if and only if "a" is divisible by 100. This may be tested using the following code nsig <- digits for (j in 1:nsig) { a <- a / 10 if (a == floor(a)) { nsig <- nsig - 1 } else { break } } ## nsig == 0 if and only if we started with a == 10^digits if (nsig == 0) { nsig <- 1 kpower <- kpower + 1 } This code uses double precision for a, since values up to 10^digits may occur and digits may be 15 or more. The algorithm is not exact for digits = 15, but works reasonably. I suggest to use a different, slower and more accurate algorithm, if getOption("digits") >= 16. Please, find in an attachment two versions of the above algorithm. One is a modification of your R level function from scientific.R and the other is a patch against src/main/format.c, which i tested. I suggest to consider the following test cases. The presented output is obtained using the patch. example1 <- c( 7.94, 7.950001, 8.04, 8.050001) for (x in example1) print(x, digits=2) [1] 7.9 [1] 8 [1] 8 [1] 8.1 example2 <- c( 3.599499, 3.599501, 3.600499, 3.600501) for (x in example2) print(x, digits=4) [1] 3.599 [1] 3.6 [1] 3.6 [1] 3.601 example3 <- c( 1.004999, 1.005001) for (x in example3) print(x, digits=7) [1] 1 [1] 1.01 example4 <- c( 9.994999, 9.995001) for (x in example4) print(x, digits=7) [1] 9.99 [1] 10 I appreciate comments. Petr Savicky. ###--- R function that does the same as 'scientific' in ###--- /usr/local/R-0.50/src/main/format.c ###--- ~~~||~~ scientific1 <- function(x, digits = getOption('digits')) ## eps = 10^-(digits) { ##- /* for 1 number x , return ##-* sgn= 1_{x < 0} {0/1} ##-* kpower = Exponent of 10; ##-* nsig = min(digits, #{significant digits of alpha}) ##-* ##-* where |x| = alpha * 10^kpower and 1 <= alpha < 10 ##-*/ eps <- 10 ^(-digits) if (x == 0) { kpower <- 0 nsig <- 1 sgn <- 0 } else { if(x < 0) { sgn <- 1; r <- -x } else { sgn <- 0; r <- x } kpower <- floor(log10(r));##--> r = |x| ; 10^k <= r if (kpower <= -308) ## close to denormalization -- added for R 2.x.y alpha <- (r * 1e+30) / 10^(kpower+30) else alpha <- r / 10^kpower ## "a" integer, 10^(digits-1) <= a <= 10^digits a <- floor(alpha*10^(digits - 1) + 0.5) nsig <- digits for (j in 1:nsig) { a <- a / 10 if (a == floor(a)) { nsig <- nsig - 1 } else { break } } if (nsig == 0) { nsig <- 1 kpower <- kpower + 1 } } left <- kpower + 1 c(sgn = sgn, kpower = kpower, nsig = nsig, left = left, right = nsig - left, sleft = sgn + max(1,left)) } diff --minimal -U 5 -r R-devel/src/main/format.c R-devel-print/src/main/format.c --- R-devel/src/main/format.c 2010-04-27 17:52:24.0 +0200 +++ R-devel-print/src/main/format.c 201
Re: [Rd] R command line and pipe using in Linux?
On Mon, Feb 14, 2011 at 05:40:29PM +, Hang PHAN wrote: > Hi, > I have a very large data file(GB) from which I only want to extract one > column to draw histogram. This would be done several times, so I would like > to ask if there is anyway to plot this using R from the linux command line, > something look like this > > cut -f1 xxx.txt |RplotHist Hi Hang: Can you use something like the following? x <- as.numeric(system("cut -f1 xxx.txt", intern=TRUE)) According to ?system, long lines will be split, however, no limit on the number of lines of the output is formulated there. Petr Savicky. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] error in memCompress() in make check
Dear R developers: When i run "make check" in R-patched_2011-02-12 and R-devel_2011-02-22 on a specific machine, the test fails and the file tests/Examples/base-Ex.Rout.fail ends with > txt.xz <- memCompress(txt, "x") Error in memCompress(txt, "x") : internal error 5 in memCompress Execution halted The error may be reproduced using commands txt <- readLines(file.path(R.home("doc"), "COPYING")) txt.xz <- memCompress(txt, "x") in both the above installed versions. The machine is CentOS release 5.4 (Final) under VMware. I did not observe this error on other machines, some of which are also CentOS. For the development version, sessionInfo() is R version 2.13.0 Under development (unstable) (2011-02-22 r54523) Platform: i686-pc-linux-gnu (32-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base gcc --version gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-46) Petr Savicky. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] error in memCompress() in make check
On Tue, Feb 22, 2011 at 11:25:50AM +, Prof Brian Ripley wrote: > So it seems that there is something wrong with the liblzma library > used on that machine. Did it use the version supplied with R or an > external library (which is the default if one is found)? My first > step would be to force the internal version via --with-system-xz=no. Thank you for your reply. I tried ./configure --with-x=no --with-system-xz=no in a clean R-devel_2011-02-22 and the result of make check is the same. The commands txt <- readLines(file.path(R.home("doc"), "COPYING")) txt.xz <- memCompress(txt, "x") do not produce an error, if the compiled R runs in the same shell, where "make check" was run. However, they produce the error, if R is started in a new shell. The command find /usr -name "liblzma*" has empty output. Petr Savicky. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] error in memCompress() in make check
On Tue, Feb 22, 2011 at 01:38:07PM +0100, Petr Savicky wrote: ... > The commands > > txt <- readLines(file.path(R.home("doc"), "COPYING")) > txt.xz <- memCompress(txt, "x") > > do not produce an error, if the compiled R runs in the same shell, > where "make check" was run. However, they produce the error, if R is > started in a new shell. Athough i did see the above two lines with no error on my screen, the change of the behavior is not reproducible. I am sorry, i probably mixed up windows or something. According to a repeated test, the above two lines produce the error Error in memCompress(txt, "x") : internal error 5 in memCompress on the machine described in the first email, even if --with-system-xz=no was used for configuration. Petr Savicky. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Anomaly with unique and match
On Wed, Mar 09, 2011 at 08:48:10AM -0600, Terry Therneau wrote: > I stumbled onto this working on an update to coxph. The last 6 lines > below are the question, the rest create a test data set. > > tmt585% R > R version 2.12.2 (2011-02-25) > Copyright (C) 2011 The R Foundation for Statistical Computing > ISBN 3-900051-07-0 > Platform: x86_64-unknown-linux-gnu (64-bit) > > # Lines of code from survival/tests/singtest.R > > library(survival) > Loading required package: splines > > test1 <- data.frame(time= c(4, 3,1,1,2,2,3), > + status=c(1,NA,1,0,1,1,0), > + x= c(0, 2,1,1,1,0,0)) > > > > temp <- rep(0:3, rep(7,4)) > > > > stest <- data.frame(start = 10*temp, > + stop = 10*temp + test1$time, > + status = rep(test1$status,4), > + x = c(test1$x+ 1:7, rep(test1$x,3)), > + epoch = rep(1:4, rep(7,4))) > > > > fit1 <- coxph(Surv(start, stop, status) ~ x * factor(epoch), stest) > > ## New lines > > temp1 <- fit1$linear.predictor > > temp2 <- as.matrix(temp1) > > match(temp1, unique(temp1)) > [1] 1 2 3 4 4 5 6 7 7 7 6 6 6 8 8 8 6 6 6 9 9 9 6 6 > > match(temp2, unique(temp2)) > [1] 1 2 3 4 4 5 6 7 7 7 6 6 6 NA NA NA 6 6 6 8 8 8 > 6 6 > > --- > > I've solved it for my code by not calling match on a 1 column vector. > In general, however, should I be using some other paradym for this "map > to unique" operation? For example match(as.character(x), > unique(as.character(x)) ? Let me suggest an alternative, which is consistent with unique() on numeric vectors and uses a transformation of the column using rank(). For example, temp3 <- as.matrix(rank(temp1, ties.method="max")) match(temp3, unique(temp3)) [1] 1 2 3 4 4 5 6 7 7 7 6 6 6 8 8 8 6 6 6 9 9 9 6 6 Can this be used in your code? Petr Savicky. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] unique.matrix issue [Was: Anomaly with unique and match]
On Wed, Mar 09, 2011 at 02:11:49PM -0500, Simon Urbanek wrote: > match() is a red herring here -- it is really a very specific thing that has > to do with the fact that you're running unique() on a matrix. Also it's much > easier to reproduce: > > > x=c(1,1+0.2e-15) > > x > [1] 1 1 > > sprintf("%a",x) > [1] "0x1p+0" "0x1.1p+0" > > unique(x) > [1] 1 1 > > sprintf("%a",unique(x)) > [1] "0x1p+0" "0x1.1p+0" > > unique(matrix(x,2)) > [,1] > [1,]1 > > and this comes from the fact that unique.matrix uses string representation > since it has to take into account all values of a row/column so it pastes all > values into one string, but for the two numbers that is the same: > > as.character(x) > [1] "1" "1" I understand the use of match() in the original message by Terry Therneau as an example of a situation, where the behavior of unique.matrix() becomes visible even without looking at the last bits of the numbers. Let me suggest to consider the following example. x <- 1 + c(1.1, 1.3, 1.7, 1.9)*1e-14 a <- cbind(rep(x, each=2), 2) rownames(a) <- 1:nrow(a) The correct set of rows may be obtained using unique(a - 1) [,1] [,2] 1 1.110223e-141 3 1.310063e-141 5 1.709743e-141 7 1.909584e-141 However, due to the use of paste(), which uses as.character(), in unique.matrix(), we also have unique(a) [,1] [,2] 112 512 Let me suggest to consider a transformation of the numeric columns by rank() before the use of paste(). For example unique.mat <- function(a) { temp <- apply(a, 2, rank, ties.method="max") temp <- apply(temp, 1, function(x) paste(x, collapse = "\r")) a[!duplicated(temp), , drop=FALSE] } unique.mat(a) [,1] [,2] 112 312 512 712 Petr Savicky. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] unique.matrix issue [Was: Anomaly with unique and match]
On Thu, Mar 10, 2011 at 01:19:48AM -0800, Henrik Bengtsson wrote: > It should be possible to run unique()/duplicated() column by column > and incrementally update the set of unique/duplicated rows. This > would avoid any coercing. The benefit should be even greater for > data.frame():s. This is a good point. An implementation of this using sorting can be done as follows Sort the data frame using function order(). Determine the groups of consecutive equal rows in the sorted df. Map the first row of each group to the original order of the rows. Since sorting by the function order() is stable, we obtain the first in each group of equal rows also in the original order. The coercion approach uses hashing for string comparison, but the efficiency of hashing seems to be overweighted by the inefficiency of the coercion. So, we get the following comparison. a <- matrix(sample(c(1234, 5678), 12*1, replace=TRUE), ncol=12) df <- data.frame(a) do.unique.sort <- function(df) { i <- do.call(order, df) n <- nrow(df) u <- c(TRUE, rowSums(df[i[2:n], ] == df[i[1:(n-1)], ]) < ncol(df)) df[u[order(i)], ] } system.time(out1 <- do.unique.sort(df)) system.time(out2 <- unique(df)) identical(out1, out2) The result may be, for example user system elapsed 0.279 0.000 0.273 user system elapsed 0.514 0.000 0.468 [1] TRUE On another computer user system elapsed 0.058 0.000 0.058 user system elapsed 0.187 0.000 0.188 [1] TRUE On Thu, Mar 10, 2011 at 01:39:56PM -0600, Terry Therneau wrote: > Simon pointed out that the issue I observed was due to internal > behaviour of unique.matrix. > > I had looked carefully at the manual pages before posting the question > and this was not mentioned. Perhaps an addition could be made? According to the description of unique(), the user may expect that if b is obtained using b <- unique(a) then for every "i" there is "j", such that all(a[i, ] == b[j, ]) This is usually true, but not always, because among several numbers in "a" with the same as.character() only one remains in "b". If this is intended, then i support the suggestion to include a note in the documentation. Let me add an argument against using as.character() to determine, whether two numbers are close. The maximum relative difference between the numbers, which have the same 15 digit decimal representation, varies by a factor up to 10 in different ranges. Due to this, we have x <- 1 + c(1.1, 1.3, 1.7, 1.9)*1e-14 unique(as.character(x)) [1] "1.01" "1.02" unique(as.character(9*x)) [1] "9.1" "9.12" "9.15" "9.17" The relative differences between components of 9*x are the same as the relative differences in x, but if the mantissa begins with 9, then a smaller relative difference is sufficient to change 15-th digit. In terms of unique(), this implies nrow(unique(cbind(x))) [1] 2 nrow(unique(cbind(9*x))) [1] 4 Petr Savicky. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] round, unique and factor
On Mon, Mar 21, 2011 at 10:46:41AM -0500, Terry Therneau wrote: > Survfit had a bug in some prior releases due to the use of both > unique(times) and table(times); I fixed it by rounding to 15 digits per > the manual page for as.character. Yes, I should ferret out all the > usages instead, but this was fast and it cured the user's problem. > The bug is back! A data set from a local colleage triggers it. > I can send the rda file to anyone who wishes. > > The current code has > digits <- floor((.Machine$double.digits) * > logb(.Machine$double.base,10)) #base 10 digits > Y[,1] <- signif(Y[,1], digits) > > which gives 15 digits; should I subtract one more? > > Should the documentation change? > > In the meantime I'm looking at the more permanent fix of turning time > into a factor, then back at the very end. Because it is a bigger change > the potential for breakage is higer, however. Can you describe the error in more detail? Is it related to consistency of converting a number to character and back? Petr Savicky. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] testing presence of pdflatex in R CMD check
The nodes of our cluster do not have pdflatex installed. When running R CMD check there on a package with no errors in documentation, then R-2.13.0-alpha and R-2.12.2 report a possible error in Rd files, while R-2.11.1 did not. The platform is 64 bit CentOS. The output of R CMD check tree_1.0-28.tar.gz under the above three versions of R contains the following. R-2.11.1 stderr sh: pdflatex: command not found sh: latex: command not found stdout * checking for working pdflatex ... NO R-2.12.2 stderr sh: pdflatex: command not found Error in texi2dvi("Rd2.tex", pdf = (out_ext == "pdf"), quiet = FALSE, : unable to run 'pdflatex' on 'Rd2.tex' Error in running tools::texi2dvi stdout * checking PDF version of manual ... WARNING LaTeX errors when creating PDF version. This typically indicates Rd problems. * checking PDF version of manual without hyperrefs or index ... ERROR Re-running with no redirection of stdout/stderr. Hmm ... looks like a package You may want to clean up by 'rm -rf /tmp/RtmpgnWKRW/Rd2pdf3723367e' 2.13.0-alpha stderr Error in texi2dvi("Rd2.tex", pdf = (out_ext == "pdf"), quiet = FALSE, : pdflatex is not available Error in running tools::texi2dvi stdout * checking PDF version of manual ... WARNING LaTeX errors when creating PDF version. This typically indicates Rd problems. * checking PDF version of manual without hyperrefs or index ... ERROR Re-running with no redirection of stdout/stderr. Hmm ... looks like a package You may want to clean up by 'rm -rf /tmp/RtmpQ0WawT/Rd2pdf41481e1' Is it intentional not to test the presence of pdflatex during R CMD check? Petr Savicky. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] duplicates() function
On Fri, Apr 08, 2011 at 10:59:10AM -0400, Duncan Murdoch wrote: > I need a function which is similar to duplicated(), but instead of > returning TRUE/FALSE, returns indices of which element was duplicated. > That is, > > > x <- c(9,7,9,3,7) > > duplicated(x) > [1] FALSE FALSE TRUE FALSE TRUE > > > duplicates(x) > [1] NA NA 1 NA 2 > > (so that I know that element 3 is a duplicate of element 1, and element > 5 is a duplicate of element 2, whereas the others were not duplicated > according to our definition.) > > Is there a simple way to write this function? A possible strategy is to use sorting. In a sorted matrix or data frame, the elements, which are duplicates of the same element, form consecutive blocks. These blocks may be identified using !duplicated(), which determines the first elements of these blocks. Since sorting is stable, when we map these blocks back to the original order, the first element of each block is mapped to the first ocurrence of the given row in the original order. An implementation may be done as follows. duplicates <- function(dat) { s <- do.call("order", as.data.frame(dat)) non.dup <- !duplicated(dat[s, ]) orig.ind <- s[non.dup] first.occ <- orig.ind[cumsum(non.dup)] first.occ[non.dup] <- NA first.occ[order(s)] } x <- cbind(1, c(9,7,9,3,7) ) duplicates(x) [1] NA NA 1 NA 2 The line orig.ind <- s[non.dup] creates a vector, whose length is the number of non-duplicated rows in the sorted "dat". Its components are indices of the corresponding first occurrences of these rows in the original order. For this, the stability of the order is needed. The lines first.occ <- orig.ind[cumsum(non.dup)] first.occ[non.dup] <- NA expand orig.ind to a vector, which satisfies: If i-th row of the sorted "dat" is duplicated, then first.occ[i] is the index of the first row in the original "dat", which is equal to this row. So, the values in first.occ are those, which are required for the output of duplicates(), but they are in the order of the sorted "dat". The last line first.occ[order(s)] reorders the vector to the original order of the rows. Petr Savicky. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] duplicates() function
On Mon, Apr 11, 2011 at 02:05:11PM -0400, Duncan Murdoch wrote: > On 08/04/2011 11:39 AM, Joshua Ulrich wrote: > >On Fri, Apr 8, 2011 at 10:15 AM, Duncan Murdoch > > wrote: > >> On 08/04/2011 11:08 AM, Joshua Ulrich wrote: > >>> > >>> How about: > >>> > >>> y<- rep(NA,length(x)) > >>> y[duplicated(x)]<- match(x[duplicated(x)] ,x) > >> > >> That's a nice solution for vectors. Unfortunately for me, I have a > >matrix > >> (which duplicated() handles by checking whole rows). So a better > >example > >> that I should have posted would be > >> > >> x<- cbind(1, c(9,7,9,3,7) ) > >> > >> and I'd still like the same output > >> > >For a matrix, could you apply the same strategy used in duplicated()? > > > >y<- rep(NA,NROW(x)) > >temp<- apply(x, 1, function(x) paste(x, collapse="\r")) > >y[duplicated(temp)]<- match(temp[duplicated(temp)], temp) > > Since this thread hasn't ended, I will say that I think this solution is > the best I've seen for my specific problem. I was actually surprised > that duplicated() did the string concatenation trick, but since it does, > it makes a lot of sense to do the same in duplicates(). Consistency with duplicated() is a good argument. Let me point out, although it goes beyond the original question, that sorting may be used to compute duplicated() in a way, which is more efficient than the paste() approach according to the test below. duplicatedSort <- function(df) { n <- nrow(df) if (n == 1) { return(FALSE) } else { s <- do.call(order, as.data.frame(df)) equal <- df[s[2:n], , drop=FALSE] == df[s[1:(n-1)], , drop=FALSE] dup <- c(FALSE, rowSums(equal) == ncol(df)) return(dup[order(s)]) } } The following tests efficiency for a character matrix. m <- 1000 n <- 4 a <- matrix(as.character(sample(10, m*n, replace=TRUE)), nrow=m, ncol=n) system.time(out1 <- duplicatedSort(a)) system.time(out2 <- duplicated(a)) identical(out1, out2) table(out1) I obtained, for example, user system elapsed 0.003 0.000 0.003 user system elapsed 0.012 0.000 0.011 [1] TRUE out1 FALSE TRUE 94258 For a numeric matrix, the ratio of the running times is larger in the same direction. Petr Savicky. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Dangerous Bug with IF function of R
On Mon, Apr 18, 2011 at 09:12:41AM -0700, salmajj wrote: > hi! > there is a bug with the IF operator that is really dangerous! > please try the code below and if someone could explain to me why when (q is > equal to 0.8, 0.9 or 1) R do not print it? > > q=0 > for (j in 1:11){ > > if ((q==1)){ > print(q) > } > q=q+0.1 > } > > so in this code q is incremented from 0 to 1.1. but R do not capture print > the value 1, 0.8 and 0.9 !! try to change (q==0.4) it gonna print 0.4. but > if you put q==0.8 or 0.9 or 1 it doesn't work!!! > please try it it is amazing!!! Incrementing a number by 0.1 produces numbers, which are not exactly representable in binary, so this operation involves a rounding error. Try the following q=0 for (j in 1:11){ if ((q==1)){ print(q) } q=round(q+0.1, digits=7) } Petr Savicky. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] matrix not positive definite (while it should be)
On Thu, May 05, 2011 at 02:31:59PM -0400, Arthur Charpentier wrote: > I do have some trouble with matrices. I want to build up a covariance matrix > with a hierarchical structure). For instance, in dimension n=10, I have two > subgroups (called REGION). > > NR=2; n=10 > CORRELATION=matrix(c(0.4,-0.25, > -0.25,0.3),NR,NR) > REGION=sample(1:NR,size=n,replace=TRUE) > R1=REGION%*%t(rep(1,n)) > R2=rep(1,n)%*%t(REGION) > SIGMA=matrix(NA,n,n) > > for(i in 1:NR){ > for(j in 1:NR){ > SIGMA[(R1==i)&(R2==j)]=CORRELATION[i,j] > }} > > If I run quickly some simulations, I build up the following matrix > > > CORRELATION > [,1] [,2] > [1,] 0.40 -0.25 > [2,] -0.25 0.30 > > REGION > [1] 2 2 1 1 2 1 2 1 1 2 > > SIGMA >[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] > [1,] 0.30 0.30 -0.25 -0.25 0.30 -0.25 0.30 -0.25 -0.25 0.30 > [2,] 0.30 0.30 -0.25 -0.25 0.30 -0.25 0.30 -0.25 -0.25 0.30 > [3,] -0.25 -0.25 0.40 0.40 -0.25 0.40 -0.25 0.40 0.40 -0.25 > [4,] -0.25 -0.25 0.40 0.40 -0.25 0.40 -0.25 0.40 0.40 -0.25 > [5,] 0.30 0.30 -0.25 -0.25 0.30 -0.25 0.30 -0.25 -0.25 0.30 > [6,] -0.25 -0.25 0.40 0.40 -0.25 0.40 -0.25 0.40 0.40 -0.25 > [7,] 0.30 0.30 -0.25 -0.25 0.30 -0.25 0.30 -0.25 -0.25 0.30 > [8,] -0.25 -0.25 0.40 0.40 -0.25 0.40 -0.25 0.40 0.40 -0.25 > [9,] -0.25 -0.25 0.40 0.40 -0.25 0.40 -0.25 0.40 0.40 -0.25 > [10,] 0.30 0.30 -0.25 -0.25 0.30 -0.25 0.30 -0.25 -0.25 0.30 Hi. If X is a random vector from the 2 dimensional normal distribution with the covariance matrix [,1] [,2] [1,] 0.40 -0.25 [2,] -0.25 0.30 then the vector X[REGION], which consists of replicated components of X, has the expanded covariance matrix n times n, which you ask for. Since the mean and the covariance matrix determine the distribution uniquely, this is also a description of the required distribution. The distribution is concentrated in a 2 dimensional subspace, since the covariance matrix has rank 2. Hope this helps. Petr Savicky. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] portable parallel seeds project: request for critiques
On Fri, Feb 17, 2012 at 02:57:26PM -0600, Paul Johnson wrote: > I've got another edition of my simulation replication framework. I'm > attaching 2 R files and pasting in the readme. > > I would especially like to know if I'm doing anything that breaks > .Random.seed or other things that R's parallel uses in the > environment. > > In case you don't want to wrestle with attachments, the same files are > online in our SVN > > http://winstat.quant.ku.edu/svn/hpcexample/trunk/Ex66-ParallelSeedPrototype/ > > > ## Paul E. Johnson CRMDA > ## Portable Parallel Seeds Project. > ## 2012-02-18 > > Portable Parallel Seeds Project > > This is how I'm going to recommend we work with random number seeds in > simulations. It enhances work that requires runs with random numbers, > whether runs are in a cluster computing environment or in a single > workstation. > > It is a solution for two separate problems. > > Problem 1. I scripted up 1000 R runs and need high quality, > unique, replicable random streams for each one. Each simulation > runs separately, but I need to be confident their streams are > not correlated or overlapping. For replication, I need to be able to > select any run, say 667, and restart it exactly as it was. > > Problem 2. I've written a Parallel MPI (Message Passing Interface) > routine that launches 1000 runs and I need to assure each has > a unique, replicatable, random stream. I need to be able to > select any run, say 667, and restart it exactly as it was. > > This project develops one approach to create replicable simulations. > It blends ideas about seed management from John M. Chambers > Software for Data Analysis (2008) with ideas from the snowFT > package by Hana Sevcikova and Tony R. Rossini. > > > Here's my proposal. > > 1. Run a preliminary program to generate an array of seeds > > run1: seed1.1 seed1.2 seed1.3 > run2: seed2.1 seed2.2 seed2.3 > run3: seed3.1 seed3.2 seed3.3 > ... ... ... > run1000 seed1000.1 seed1000.2 seed1000.3 > > This example provides 3 separate streams of random numbers within each > run. Because we will use the L'Ecuyer "many separate streams" > approach, we are confident that there is no correlation or overlap > between any of the runs. > > The projSeeds has to have one row per project, but it is not a huge > file. I created seeds for 2000 runs of a project that requires 2 seeds > per run. The saved size of the file 104443kb, which is very small. By > comparison, a 1400x1050 jpg image would usually be twice that size. > If you save 10,000 runs-worth of seeds, the size rises to 521,993kb, > still pretty small. > > Because the seeds are saved in a file, we are sure each > run can be replicated. We just have to teach each program > how to use the seeds. That is step two. Hi. Some of the random number generators allow as a seed a vector, not only a single number. This can simplify generating the seeds. There can be one seed for each of the 1000 runs and then, the rows of the seed matrix can be c(seed1, 1), c(seed1, 2), ... c(seed2, 1), c(seed2, 2), ... c(seed3, 1), c(seed3, 2), ... ... There could be even only one seed and the matrix can be generated as c(seed, 1, 1), c(seed, 1, 2), ... c(seed, 2, 1), c(seed, 2, 2), ... c(seed, 3, 1), c(seed, 3, 2), ... If the initialization using the vector c(seed, i, j) is done with a good quality hash function, the runs will be independent. What is your opinion on this? An advantage of seeding with a vector is also that there can be significantly more initial states of the generator among which we select by the seed than 2^32, which is the maximum for a single integer seed. Petr Savicky. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] portable parallel seeds project: request for critiques
On Fri, Feb 17, 2012 at 09:33:33PM -0600, Paul Johnson wrote: [...] > The seed things I'm using are the 6 integer values from the L'Ecuyer. > If you run the example script, the verbose option causes some to print > out. The first 3 seeds in a saved project seeds file looks like: > > > projSeeds[[1]] > [[1]] > [1] 407 376488316 1939487821 1433925148 -1040698333 579503880 > [7] -624878918 > > [[2]] > [1] 407 -1107332181 854177397 1773099324 1774170776 -266687360 > [7] 816955059 > > [[3]] > [1] 407 936506900 -1924631332 -1380363206 2109234517 1585239833 > [7] -1559304513 > > The 407 in the first position is an integer R uses to note the type of > stream for which the seed is intended, in this case R'Lecuyer. The paralel streams uses MRG32k3a generator by P.L'Ecuyer, whose original implementation stores its internal state as double precision floating point numbers. The vector .Random.seed consists of integers. Do you know, whether a modified implementation of MRG32k3a is used, which works with integers internally, or the conversion between double and integer is done whenever the state is stored to .Random.seed? > I don't have any formal proof that a "good quality hash function" > would truly create seeds from which independent streams will be drawn. One of the suggested ways how to generate, say, 2^16 cryptographically secure random numbers, is to use a counter producing a sequence 1, 2, 3, ..., 2^16 and transform these values by a hash function. An example is Fortuna generator http://en.wikipedia.org/wiki/Fortuna_(PRNG) which is also available on CRAN as "randaes". The length of the sequence is limited, since the sequence contains no collisions. If the sequence is too long, this could allow to distinguish it from truly random numbers. After generating 2^16 numbers, the seed is recomputed and another 2^16 numbers are generated. Such a generator is a very good one. It is not used for simulations, since it is slower (say by a factor of 5) than the linear generators like Mersenne Twister, WELL or MRG family of generators. However, if it is used only for initialization, then the speed is less important. > There is, however, the proof in the L'Ecuyer paper that one can take > the long stream and divide it into sections. That's the approach I'm > taking here. Its the same approach the a parallel package in R > follows, and parallel frameworks like snow. According to http://www.iro.umontreal.ca/~lecuyer/myftp/papers/streams00.pdf the sectioning of the stream to substreams is done by jump ahead algorithm. The new seed is far enough in the sequence from the previous seed, so it is guaranteed that the sequence generated from the previous seed is shorter than the jump and the subsequences do not overlap. However, the new seed is computable as a function of the previous seed. If we use this strategy to produce a matrix of seeds s_{1,1}, s_{1,2}, ... s_{2,1}, s_{2,2}, ... s_{3,1}, s_{2,2}, ... then each s_{i,j} may be computed from s_{1,1} and i, j. In this case, it is sufficient to store s_{1,1}. If we know for each run the indices i,j, then we can compute s_{i,j} by an efficient algorithm. > The different thing in my approach is that I'm saving one row of seeds > per simulation "run". So each run can be replicated exactly. > > I hope. Saving .Random.seed should be a safe strategy. Petr Savicky. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] portable parallel seeds project: request for critiques
On Fri, Feb 17, 2012 at 02:57:26PM -0600, Paul Johnson wrote: > I've got another edition of my simulation replication framework. I'm > attaching 2 R files and pasting in the readme. > > I would especially like to know if I'm doing anything that breaks > .Random.seed or other things that R's parallel uses in the > environment. > > In case you don't want to wrestle with attachments, the same files are > online in our SVN > > http://winstat.quant.ku.edu/svn/hpcexample/trunk/Ex66-ParallelSeedPrototype/ Hi. In the description of your project in the file http://winstat.quant.ku.edu/svn/hpcexample/trunk/Ex66-ParallelSeedPrototype/README you argue as follows Question: Why is this better than the simple old approach of setting the seeds within each run with a formula like set.seed(2345 + 10 * run) Answer: That does allow replication, but it does not assure that each run uses non-overlapping random number streams. It offers absolutely no assurance whatsoever that the runs are actually non-redundant. The following demonstrates that the function set.seed() for the default generator indeed allows to have correlated streams. step <- function(x) { x[x < 0] <- x[x < 0] + 2^32 x <- (69069 * x + 1) %% 2^32 x[x > 2^31] <- x[x > 2^31] - 2^32 x } n <- 1000 seed1 <- 124370417 # any seed seed2 <- step(seed1) set.seed(seed1) x <- runif(n) set.seed(seed2) y <- runif(n) rbind(seed1, seed2) table(x[-1] == y[-n]) The output is [,1] seed1 124370417 seed2 205739774 FALSE TRUE 5 994 This means that if the streams x, y are generated from the two seeds above, then y is almost exactly equal to x shifted by 1. What is the current state of your project? Petr Savicky. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] OpenMP and random number generation
Hi. > Now that R has OpenMP facilities, I'm trying to use it for my own package but > I'm still wondering if it is safe to use random number generation within a > OpenMP block. I looked at the R writing extension document both on the > OpenMP and Random number generation but didn't find any information about > that. Package CORElearn (i am a coauthor) uses random number generation under OpenMP in C++. This requires to have a separate copy of the generator with its own memory for each thread. Do you want to use it in C or C++? Petr Savicky. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] portable parallel seeds project: request for critiques
On Wed, Feb 22, 2012 at 12:17:25PM -0600, Paul Johnson wrote: [...] > In order for this to be easy for users, I need to put the init streams > and set current stream functions into a package, and then streamline > the process of creating the seed array. My opinion is that CRAN is > now overflowed with too many "one function" packages, I don't want to > create another one just for these two little functions, and I may roll > it into my general purpose regression package called "rockchalk". I am also preparing a solution to the problem. One is based on AES used for initialization of the R base Mersenne-Twister generator, so it only replaces set.seed() function. Another solution is based on "rlecuyer" package. I suggest to discuss the possible solutions off-list before submitting to CRAN. > One technical issue that has been raised to me is that R parallel's > implementation of the L'Ecuyer generator is based on integer valued > variables, whereas the original L'Ecuyer code uses double real > variables. But I'm trusting the R Core on this, if they code the > generator in a way that is good enough for R itself, it is good enough > for me. (And if that's wrong, we'll all find out together :) ). I do not know about any L'Ecuyer's generator in R base. You probably mean the authors of the extension packages with these generators. > Josef Leydold (the rstream package author) has argued that R's > implementation runs more slowly than it ought to. We had some > correspondence and I tracked a few threads in forums. It appears the > approach suggested there is roadblocked by some characteristics deep > down in R and the way random streams are managed. Packages have only > a limited, tenuous method to replace R's generators with their own > generators. In order to connect a user defined generator to R, there are two obligatory entry points "user_unif_rand" and "user_unif_init". The first allows to call the generator from runif() and the similar functions. The second connects the generator to set.seed() function. If there is only one extension package with a generator loaded to an R session, then these entry points are good enough. If the package provides several generators, like "randtoolbox", it is possible to change between them easily using functions provided by the package for this purpose. I think that having several packages with generators simultaneously can be good for their development, but this is not needed for their use in applications. There are also two other entry points "user_unif_nseed" and "user_unif_seedloc", which allow to support the variable ".Random.seed". A problem with this is that R converts the internal state of the generator to ".Random.seed" by reading a specific memory location, but does not alert the package about this event. So, if the state requires a transformation to integer before storing to ".Random.seed", it is not possible to do this only when needed. In the package "rngwell19937", i included some code that tries to determine, whether the user changed ".Random.seed" or not. The reason is that most of the state is integer and is stored to ".Random.seed", but the state contains also a function pointer, which is not stored. It can be recomputed from ".Random.seed" and this recomputing is done, if the package detects a change of ".Random.seed". This is not a nice solution. So in "randtoolbox" we decided not to support ".Random.seed". I understand that in the area of parallel computing, the work with ".Random.seed" is a good paradigm, but if the generator provides other tools for converting the state to an R object and put it back to the active state, then ".Random.seed" is not strictly necessary. > Parallel Random Number Generation in C++ and R Using RngStream > Andrew Karl · Randy Eubank · Dennis Young > http://math.la.asu.edu/~eubank/webpage/rngStreamPaper.pdf Thank you very much for this link. All the best, Petr. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] portable parallel seeds project: request for critiques
On Fri, Mar 02, 2012 at 10:36:14AM +0100, Karl Forner wrote: [...] > Hello, > I would be also in favor for using multiple seeds based on (seed, > task_number) for convenience (i.e. avoiding storing the seeds) > and with the possibility of having a dynamic number of tasks, but I am mot > sure it is theoretically correct. > But I can refer you to this article: > http://www.agner.org/random/ran-instructions.pdf , section 6.1 > where the author states: > > For example, if we make 100 streams of 10^10 random numbers each from an > > SFMT > > generator with cycle length ρ = 2^11213, we have a probability of overlap > > p ≈ 10^3362. > > > > What do you think ? I am very concerned by the correctness of this approach > so would appreciate any advice on that matter. Hi. If correctness is crucial, then the literature on cryptography provides the best answers. At the current state of knowledge, it is not possible to prove that a generator is undistinguishable from truly random numbers in a mathematical sense. The problem is that if we can prove this for any generator, then the proof also implies P \not= NP. This is an open problem, so proving that any generator is correct in the sense of undistinguishability is also open. The best, what we can have, is a generator, which cannot be distinguished from truly random numbers using the known methods, although experts on cryptography tried to find such a method intensively. An example of such a generator is Fortuna generator using AES, which is available at CRAN as "randaes". http://en.wikipedia.org/wiki/Fortuna_PRNG The same can be said about initializations. Initialization is a random number generator, whose output is used as the initial state of some other generator. There is no proof that a particular initialization cannot be distinguished from truly random numbers in a mathematical sense for the same reason as above. A possible strategy is to use a cryptographically strong hash function for the initialization. This means to transform the seed to the initial state of the generator using a function, for which we have a good guarantee that it produces output, which is computationally hard to distinguish from truly random numbers. For this purpose, i suggest to use the package rngSetSeed provided currently at http://www.cs.cas.cz/~savicky/randomNumbers/ It is based on AES and Fortuna similarly as "randaes", but these components are used only for the initialization of Mersenne-Twister. When the generator is initialized, then it runs on its usual speed. In the notation of http://www.agner.org/random/ran-instructions.pdf using rngSetSeed for initialization of Mersenne-Twister is Method 4 in Section 6.1. I appreciate comments. Petr Savicky. P.S. I included some more comments on the relationship of provably good random number generators and P ?= NP question to the end of the page http://www.cs.cas.cz/~savicky/randomNumbers/ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] portable parallel seeds project: request for critiques
On Fri, Mar 02, 2012 at 10:36:14AM +0100, Karl Forner wrote: [...] > Hello, > I would be also in favor for using multiple seeds based on (seed, > task_number) for convenience (i.e. avoiding storing the seeds) > and with the possibility of having a dynamic number of tasks, but I am mot > sure it is theoretically correct. > But I can refer you to this article: > http://www.agner.org/random/ran-instructions.pdf , section 6.1 > where the author states: > > For example, if we make 100 streams of 10^10 random numbers each from an > > SFMT > > generator with cycle length ρ = 2^11213, we have a probability of overlap > > p ≈ 10^3362. > > > > What do you think ? I am very concerned by the correctness of this approach > so would appreciate any advice on that matter. Hi. First, let us consider a theoretical scenario, where the starting points are chosen independently and from the uniform distribution. Then the above should be OK. SFMT supports several different periods, one of them is 2^11213-1. If we choose 100 random starting points in the cycle of this length and consider intervals of length 10^10 each, then two of them overlap with probability (2 * 10^10 - 1) / (2^11213 - 1) which is approx 10^-3365. An upper bound on the probability that any of the 100 overlap is choose(100, 2) * (2 * 10^10 - 1) / (2^11213 - 1) which is still negligible and approx 10^-3362 as you say except of a typo in the sign. The crucial assumption is that we choose the starting points from the uniform distribution over the period. Every point on the period cycle is represented by a different internal state of the generator. So, uniform distribution on the period is equivalent to the uniform distribution on admissible states. An admissible state is represented by 11213 bits, which are not all 0. The probability of all 0 is extremely small. So, in order to generate one such state, it is sufficient to fill the state array with uniform i.i.d. bits. This is suitable only as a theoretical model, since this can guarantee reproducibility only if we store the whole state. In order to guarantee reproducibility with simpler means, we cannot fill the initial state with uniform i.i.d. random bits. So, we do not have a uniform distribution over the period, but we can have different approximations to it. A cryptographically strong hash function is such an approximation in the sense of computational indistinguishability. It does not and also cannot approximate the uniform distribution in the statistical sense, since the set of possible outputs is much smaller than the period. However, if someone gives us two initial states, one from the hash function and the other created from uniform i.i.d. bits, then there is no known computational method to distinguish these two in moderate CPU time. So, you can safely use the output of the hash function for simulations. In fact, the requirements of simulations are weaker. For example, MD5 cannot be used for cryptography any more, since there are known algorithms to break it. However, if you use it for a simulation, then the simulation will be biased only if it contains an algorithm, which breaks MD5. The probability that this happens just by chance is small. Petr Savicky. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] portable parallel seeds project: request for critiques
On Fri, Mar 02, 2012 at 01:36:34PM +0100, Karl Forner wrote: > Thanks for your quick reply. > > About the rngSetSeed package: is it usable at c/c++ level ? Not directly. The rngSetSeed package is meant to provide an R-level alternative to set.seed() for Mersenne-Twister with a better guarantee that different seeds provide unrelated streams. In particular, with rngSetSeed, it should be safe to reseed quite often, for example in each iteration of a loop like for (i in 1:n) { setVectorSeed(c(base.seed, i)) # some code } Otherwise everything remains the same as with set.seed(). In order to maintain several streams, one has to maintain copies of .Random.seed and use them, when required. > Hmm I had not paid attention to the last paragraph: > > > The seeding procedure used in the > > present software use*s a separate random number* generator of a different > > design in order to > > avoid any interference. An extra feature is the RandomInitByArray function > > which makes > > it possible to initialize the random number generator with multiple seeds. > > We can make sure > > that the streams have different starting points by using the thread id as > > one of the seeds. > > > > So it means that I am already using this solution ! (in the RcppRandomSFTM, > see other post). > and that I should be reasonably safe. If RcppRandomSFTM uses the initialization for SFMT provided by the authors of SFMT, then you should be reasonably safe. I do not know exactly the SFMT initializations, but for the original Mersenne-Twister, the initializations from 2002, both the single number seed and a vector seed avoid known problems with the previous initializations. It should be pointed out, however, that the initialization by array from 2002 is more careful than the single number initialization, So, it is better to use the array initialization even for a single number seed. Petr Savicky. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] suggestion for extending ?as.factor
On Wed, May 06, 2009 at 10:41:58AM +0200, Martin Maechler wrote: > PD> I think that the real issue is that we actually do want almost-equal > PD> numbers to be folded together. > > yes, this now (revision 48469) will happen by default, using signif(x, 15) > where '15' is the default for the new optional argument 'digitsLabels' On some platforms, the function factor() in the current R 2.10.0 (2009-05-06 r48478) may produce duplicated levels. The examples are in general platform dependent. The following one produces duplicated (in fact triplicated) levels on both Intel default arithmetic and on Intel with SSE. x <- 9.7738826945424 + c(-1, 0, 1) * 1e-14 x <- signif(x, 15) factor(x) # [1] 9.7738826945424 9.7738826945424 9.7738826945424 # Levels: 9.7738826945424 9.7738826945424 9.7738826945424 # Warning message: # In `levels<-`(`*tmp*`, value = c("9.7738826945424", "9.7738826945424", : # duplicated levels will not be allowed in factors anymore The reason is that the three numbers remain different in signif(x, 15), but are mapped to the same string in as.character(x). length(unique(x)) # [1] 3 length(unique(as.character(x))) # 1 Further examples may be found using x <- as.character(9 + runif(5000)) x <- as.numeric(x[nchar(x)==15]) # select numbers with 14 digits x <- signif(cbind(x - 1e-14, x, x + 1e-14), 15) y <- array(as.character(x), dim=dim(x)) x <- x[which(y[,1] == y[,3]),] factor(x[1,]) Petr. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] suggestion for extending ?as.factor
On Wed, May 06, 2009 at 10:41:58AM +0200, Martin Maechler wrote: > PD> I think that the real issue is that we actually do want almost-equal > PD> numbers to be folded together. > > yes, this now (revision 48469) will happen by default, using signif(x, 15) > where '15' is the default for the new optional argument 'digitsLabels' > {better argument name? (but must nost start with 'label')} Let me analyze the current behavior of factor(x) for numeric x with missing(levels) and missing(labels). In this situation, levels are computed as sort(unique(x)) from possibly transformed x. Then, labels are constructed by a conversion of the levels to strings. I understand the current (R 2.10.0, 2009-05-07 r48492) behavior as follows. If keepUnique is FALSE (the default), then - values x are transformed by signif(x, digitsLabels) - labels are computed using as.character(levels) - digitsLabels defaults to 15, but may be set to any integer value If keepUnique is TRUE, then - values x are preserved - labels are computed using sprintf("%.*g", digitsLabels, levels) - digitsLabels defaults to 17, but may be set to any integer value There are several situations, when this approach produces duplicated levels. Besides the one described in my previous email, there are also others factor(c(0.3, 0.1+0.2), keepUnique=TRUE, digitsLabels=15) factor(1 + 0:5 * 1e-16, digitsLabels=17) I would like to suggest a modification. It eliminates most of the cases, where we get duplicated levels. It would eliminate all such cases, if the function signif() works as expected. Unfortunately, if signif() works as it does in the current versions of R, we still get duplicated levels. The suggested modification is as follows. If keepUnique is FALSE (the default), then - values x are transformed by signif(x, digitsLabels) - labels are computed using sprintf("%.*g", digitsLabels, levels) - digitsLabels defaults to 15, but may be set to any integer value If keepUnique is TRUE, then - values x are preserved - labels are computed using sprintf("%.*g", 17, levels) - digitsLabels is ignored Arguments for the modification are the following. 1. If keepUnique is FALSE, then computing labels using as.character() leads to duplicated labels as demonstrated in my previous email. So, i suggest to use sprintf("%.*g", digitsLabels, levels) instead of as.character(). 2. If keepUnique is TRUE and we allow digitsLabels less than 17, then we get duplicated labels. So, i suggest to force digitsLabels=17, if keepUnique=TRUE. If signif(,digitsLabels) works as expected, than the above approach should not produce duplicated labels. Unfortunately, this is not the case. There are numbers, which remain different in signif(x, 16), but are mapped to the same string in sprintf("%.*g", 16, x). Examples of this kind may be found using the script for (i in 1:50) { x <- 10^runif(1, 38, 50) y <- x * (1 + 0:500 * 1e-16) y <- unique(signif(y, 16)) z <- unique(sprintf("%.16g", y)) stopifnot(length(y) == length(z)) } This script is tested on Intel default arithmetic and on Intel with SSE. Perhaps, digitsLabels = 16 could be forbidden, if keepUnique is FALSE. Unfortunately, a similar problem occurs even for digitsLabels = 15, although for much larger numbers. for (i in 1:200) { x <- 10^runif(1, 250, 300) y <- x * (1 + 0:500 * 1e-16) y <- unique(signif(y, 15)) z <- unique(sprintf("%.15g", y)) stopifnot(length(y) == length(z)) } This script finds collisions, if SSE is enabled, on two Intel computers, where i did the test. Without SSE, it finds collisions only on one of them. May be, it depends also on the compiler, which is different. Petr. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] suggestion for extending ?as.factor
On Fri, May 08, 2009 at 03:18:01PM +0200, Martin Maechler wrote: > As long as we don't want to allow factor() to fail --rarely -- > I think (and that actually has been a recurring daunting thought > for quite a few days) that we probably need an > extra step of checking for duplicate levels, and if we find > some, recode "everything". This will blow up the body of the > factor() function even more. > > What alternatives do you (all R-devel readers!) see? The command f <- match(x, levels) in factor() uses the original values and not their string representations. I think that the main reason to do so is that we loose the ordering, if the conversion to character is done before levels are sorted. Let me suggest to consider the following modification, where match() is done on the strings, not on the original values. levels <- unique(as.character(sort(unique(x x <- as.character(x) f <- match(x, levels) Since unique() preserves the order, we will get the levels correctly ordered. Due to using unique() twice, we will not have duplicated levels. Is it correct? Petr. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] suggestion for extending ?as.factor
On Fri, May 08, 2009 at 05:14:48PM +0200, Petr Savicky wrote: > Let me suggest to consider the following modification, where match() is done > on the strings, not on the original values. > levels <- unique(as.character(sort(unique(x > x <- as.character(x) > f <- match(x, levels) An alternative solution is ind <- order(x) x <- as.character(x) # or any other conversion to character levels <- unique(x[ind]) # get unique levels ordered by the original values f <- match(x, levels) The advantage of this over the suggestion from my previous email is that the string conversion is applied only once. The conversion need not be only as.character(). There may be other choices specified by a parametr. I have strong objections against the existing implementation of as.character(), but still i think that as.character() should be the default for factor() for the sake of consistency of the R language. Petr. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] suggestion for extending ?as.factor
On Fri, May 08, 2009 at 06:48:40PM +0200, Martin Maechler wrote: > >>>>> "PS" == Petr Savicky > >>>>> on Fri, 8 May 2009 18:10:56 +0200 writes: [...] > PS> ... I have > PS> strong objections against the existing implementation of > as.character(), > > {(because it is not *accurate* enough, right ?)} The problem is not exactly low accuracy. The problem is unpredictable accuracy. If the accuracy is systematically 15 or 14 digits, it would be fine and suitable for most purposes. However the accuracy ranges between 14 and 20 digits and may be different on different platforms. For example, on my old Xeon comupter, the same numbers may be converted to strings representing different values: with SSE without SSE "8459184.47742229" "8459184.4774223" "84307700406756081664" "8.4307700406756e+19" "9262815.27852281" "9262815.2785228" "2.1006742758024e+19" "21006742758023974912" "7.07078598983389e+25" "7.0707859898339e+25" "8.0808066145628e+28" "8.08080661456281e+28" "9180932974.85929" "9180932974.8593" "72.4923408890729" "72.492340889073" Sometimes there are differences in trailing zeros. with SSE without SSE "1.97765325859480e+25" "1.9776532585948e+25" "21762633836.0360" "21762633836.036" "2018960238339.80" "2018960238339.8" "239567.78053486" "239567.780534860" "2571116684765.50" "2571116684765.5" "3989945.2102949" "3989945.21029490" "1.1259245205867e+23" "1.12592452058670e+23" "3.2867033904477e+29" "3.28670339044770e+29" "2.8271117654895e+29" "2.82711176548950e+29" "26854166.6173020" "26854166.617302" "4.85247217360750" "4.8524721736075" "345123.247838540" "345123.24783854" For random numbers in the sample generated as 10^runif(10, 0, 30), from which i selected the first 20 examples above, the probability of different results was almost 0.01 (978 differences among 10 numbers). I think that the platform dependence even limits the advantage of backward compatibility. Petr. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] suggestion for extending ?as.factor
On Sat, May 09, 2009 at 10:55:17PM +0200, Martin Maechler wrote: [...] > If'd revert to such a solution, > we'd have to get back to Peter's point about the issue that > he'd think table(.) should be more tolerant than as.character() > about "almost equality". > For compatibility reasons, we could also return back to the > reasoning that useR should use {something like} > table(signif(x, 14)) > instead of > table(x) > for numeric x in "typical" cases. In the released versions 2.8.1 and 2.9.0, function factor() satisfies identical(as.character(factor(x)), as.character(x))(*) for all numeric x. This follows from the code (levels are computed by as.character() from unmodified input values) and may be verified even for the problematic cases, for example x <- (0.3 + 2e-16 * c(-2,-1,1,2)) factor(x) # [1] 0.300 0.3 0.3 0.300 # Levels: 0.300 0.3 0.3 0.300 as.character(x) # [1] "0.300" "0.3" "0.3" # [4] "0.300" identical(as.character(factor(x)), as.character(x)) # [1] TRUE In my opinion, it is reasonable to require that (*) be preserved also in future versions of R. Function as.character(x) has disadvantages. Besides of the platform dependence, it also does not always perform rounding needed to eliminate FP errors. Usually, as.character(x) rounds to at most 15 digits, so, we get, for example as.character(0.1 + 0.2) # [1] "0.3" as required. However, there are also exceptions, for example as.character(1e19 + 1e5) # [1] "10100352" Here, the number is printed exactly, so the resulting string contains the FP error caused by the fact that 1e19 + 1e5 has more than 53 significant digits in binary representation, namely 59. binary representation of 1e19 + 1e5 is 100010101100011100100011010010001000100111101010 binary representation of 10100352 is 100010101100011100100011010010001000100110001000 However, as.character(x) seems to do enough rounding for most purposes, otherwise it would not be suitable as the basic numeric to character conversion. If table() needs factor() with a different conversion than as.character(x), it may be done explicitly as discussed by Martin above. So, i suggest to use as.character() as the default conversion in factor(), so that identical(as.character(factor(x)), as.character(x)) is satisfied for the default usage of factor(). Of course, i appreciate, if factor() has parameters, which allow better control of the underlying conversion, as it is done in the current development versions. Petr. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] suggestion for extending ?as.factor
On Mon, May 11, 2009 at 05:06:38PM +0200, Martin Maechler wrote: [...] > The version I have committed a few hours ago is indeed a much > re-simplified version, using as.character(.) explicitly > and consequently no longer providing the extra optional > arguments that we have had for a couple of days. > > Keeping such a basic function factor() as simple as possible > seems a good strategy to me. OK. I understand the argument of simplicity. So, factor(x) is just a compressed encoding of as.character(x), where each value is stored only once. This sounds good to me. Let me go back to the original purpose of this thread: suggestion for extending ?as.factor I think that somewhere in the help page, we could have something like Using factor() to a numeric vector should be done with caution. The information in x is preserved to the extent to which it is preserved in as.character(x). If this leads to too many different levels due to minor differences among the input numbers, it is suggested to use something like factor(signif(x, digits)) or factor(round(x, digits)), where the number of decimal digits appropriate for a given application should be used. Let me point out that the following sentence from Warning is not exactly correct as it is in svn at the moment. So, i suggest to add the word "approximately" to the place marked with square brackets and add one more sentence of explanation marked also by square brackets. To transform a factor \code{f} to [approximately] its original numeric values, \code{as.numeric(levels(f))[f]} is recommended and slightly more efficient than \code{as.numeric(as.character(f))}. [Note that the original values may be extracted only to the precision used in as.character(x), which is typically 15 decimal digits.] Petr. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] suggestion for extending ?as.factor
On Mon, May 11, 2009 at 05:06:38PM +0200, Martin Maechler wrote: > The version I have committed a few hours ago is indeed a much > re-simplified version, using as.character(.) explicitly The current development version (2009-05-11 r48528) contains in ?factor a description of levels parametr levels: an optional vector of the values that 'x' might have taken. The default is the unique set of values taken by 'character(x)', sorted into increasing order _of 'x'_. Note that this set can be smaller than 'sort(unique(x))'. I think that 'character(x)' should be 'as.character(x)'. Petr. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] simple add error (PR#13699)
On Wed, May 13, 2009 at 02:35:12PM +0200, gos...@igmm.cnrs.fr wrote: > I cannot explain why R seems to have problems adding two big numbers. > > sprintf("%f",10^4+10^19) gives "10010240.00" > instead of "1001.00" > > problems seems to arrive when i'm trying to add a big and a small number... There are already two correct answers to your problem. If you are interested in more detail, it is as follows. The number 10^4+10^19 is in binary 100010101100011100100011010010001000101001110001 It has 60 significant binary digits. Machine representation (double precision) rounds numbers to 53 significant digits, so in binary representation, it becomes 10001010110001110010001101001000100010101000 which is 10010240 in decimal. So, the problem is not specific to R, but to floating point numbers in general. Floating point numbers with 53 digits are used for efficiency. If you need more accurate arithmetic on the cost of a remarkable slow down, use a computer algebra system. Some of them are even accessible from R, see http://wiki.r-project.org/rwiki/doku.php?id=misc:r_accuracy Petr. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Spearman's rank correlation test (PR#13574)
Bug report "Spearman's rank correlation test (PR#13574)" was moved to trashcan with empty Notes field. I would like to learn, what was wrong with this bug report. Can i ask the developers to add a note to it? Thank you in advance. Petr. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] as.numeric(levels(factor(x))) may be a decreasing sequence
Function factor() in the current development version (2009-05-22) guarantees that levels are different character strings. However, they may represent the same decimal number. The following example is derived from a posting by Stavros Macrakis in thread "Match .3 in a sequence" in March nums <- 0.3 + 2e-16 * c(-2,-1,1,2) f <- factor(nums) levels(f) # [1] "0.300" "0.3" The levels differ in trailing zeros, but represent the same decimal number. Besides that this is not really meaningful, it may cause a problem, when using as.numeric(levels(f)). In the above case, as.numeric() works fine and maps the two levels to the same number. However, there are cases, where the difference in trailing zeros implies different values in as.numeric(levels(f)) and these values may even form a decreasing sequence although levels were constructed from an increasing sequence of numbers. Examples are platform dependent, but may be found by the following code. Tested on Intel under Linux (both with and without SSE) and also under Windows with an older version of R. for (i in 1:10) { x <- 10^(floor(runif(1, 61, 63)) + runif(1)/2) x <- as.numeric(sprintf("%.14g", x)) eps <- 2^(floor(log2(x)) - 52) k <- round(x * c(5e-16, 1e-15) / eps) if (x > 1e62) { k <- rev( - k) } y <- x + k[1]:k[2] * eps ind <- which(diff(as.numeric(as.character(y))) < 0) for (j in ind) { u1 <- y[c(j, j+1)] u2 <- factor(u1) print(levels(u2)) print(diff(as.numeric(levels(u2 aux <- readline("next") } } An example of the output is [1] "1.2296427920313e+61" "1.22964279203130e+61" [1] -1.427248e+45 next [1] "1.82328862326830e+62" "1.8232886232683e+62" [1] -2.283596e+46 next The negative number in diff(as.numeric(levels(u2))) demonstrates cases, when as.numeric(levels(u2)) is decreasing. We can also see that the reason is that the two strings in levels(u2) differ in the trailing zeros. I did quite intensive search for such examples for all possible exponents (not only 61 and 62 and a week of CPU on three processors) and all the obtained examples were caused by a difference in trailing zeros. So, i believe that removing trailing zeros from the output of as.character(x) solves the problem with the reversed order in as.numeric(levels(factor(x))) entirely. A patch against R-devel_2009-05-22, which eliminates trailing zeros from as.character(x), but makes no other changes to as.character(x), is in an attachment. Using the patch, we obtain a better result also in the following. nums <- 0.3 + 2e-16 * c(-2,-1,1,2) factor(nums) # [1] 0.3 0.3 0.3 0.3 # Levels: 0.3 Petr. --- R-devel/src/main/coerce.c 2009-04-17 17:53:35.0 +0200 +++ R-devel-elim-trailing/src/main/coerce.c 2009-05-23 08:39:03.914774176 +0200 @@ -294,12 +294,33 @@ else return mkChar(EncodeInteger(x, w)); } +const char *elim_trailing(const char *s, char cdec) +{ +const char *p; +char *replace; +for (p = s; *p; p++) { +if (*p == cdec) { +replace = (char *) p++; +while ('0' <= *p & *p <= '9') { +if (*(p++) != '0') { +replace = (char *) p; +} +} +while (*(replace++) = *(p++)) { +; +} +break; +} +} +return s; +} + SEXP attribute_hidden StringFromReal(double x, int *warn) { int w, d, e; formatReal(&x, 1, &w, &d, &e, 0); if (ISNA(x)) return NA_STRING; -else return mkChar(EncodeReal(x, w, d, e, OutDec)); +else return mkChar(elim_trailing(EncodeReal(x, w, d, e, OutDec), OutDec)); } SEXP attribute_hidden StringFromComplex(Rcomplex x, int *warn) __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] qbinom (PR#13711)
On Wed, May 20, 2009 at 11:10:11PM +0200, wolfgang.re...@gmail.com wrote: ... > Strange behavior of qbinom: > > > qbinom(0.01, 5016279, 1e-07) > [1] 0 > > qbinom(0.01, 5016279, 2e-07) > [1] 16 > > qbinom(0.01, 5016279, 3e-07) > [1] 16 > > qbinom(0.01, 5016279, 4e-07) > [1] 16 > > qbinom(0.01, 5016279, 5e-07) > [1] 0 > There is a bug in function do_search() in file src/nmath/qbinom.c. This function contains a cycle for(;;) { if(y == 0 || (*z = pbinom(y - incr, n, pr, /*l._t.*/TRUE, /*log_p*/FALSE)) < p) return y; y = fmax2(0, y - incr); } When this cycle stops, *z contains pbinom(y - incr, ...), but is used as if it is pbinom(y, ...) for the resulting y. In the example qbinom(p=0.01, size=5016279, prob=4e-07), we get at some step y=15 as a result of a search left with incr=50, so we have *z=pbinom(-35, ...)=0. Hence, y=15 is treated as too low and is increased to 16. Since 16 is detected to be sufficient, the search stops with y=16, which is wrong. A possible correction is to replace the code above by double newz; for(;;) { if(y == 0 || (newz = pbinom(y - incr, n, pr, /*l._t.*/TRUE, /*log_p*/FALSE)) < p) return y; y = fmax2(0, y - incr); *z = newz; } Patch against R-devel_2009-05-22 is in an attachment. Verification may be done using the following examples. b1 <- qbinom(p=0.01, size=5016279, prob=(1:20)*1e-07) b2 <- qbinom(p=0.01, size=5006279, prob=(1:20)*1e-07) b3 <- qbinom(p=0.01, size=5000279, prob=(1:20)*1e-07) p1 <- qpois(p=0.01, lambda=5016279*(1:20)*1e-07) p2 <- qpois(p=0.01, lambda=5006279*(1:20)*1e-07) p3 <- qpois(p=0.01, lambda=5000279*(1:20)*1e-07) cbind(b1, b2, b3, p1, p2, p3) Wrong b1 b2 b3 p1 p2 p3 [1,] 0 0 0 0 0 0 [2,] 16 6 50 0 0 0 [3,] 16 6 50 0 0 0 [4,] 16 6 50 0 0 0 [5,] 0 0 0 0 0 0 [6,] 0 0 0 0 0 0 [7,] 0 0 0 0 0 0 [8,] 0 0 0 0 0 0 [9,] 0 0 0 0 0 0 [10,] 1 1 1 1 1 1 [11,] 1 1 1 1 1 1 [12,] 1 1 1 1 1 1 [13,] 1 1 1 1 1 1 [14,] 2 2 2 2 2 2 [15,] 2 2 2 2 2 2 [16,] 2 2 2 2 2 2 [17,] 19 9 53 3 3 3 [18,] 3 3 3 3 3 3 [19,] 3 3 3 3 3 3 [20,] 3 3 3 3 3 3 Correct b1 b2 b3 p1 p2 p3 [1,] 0 0 0 0 0 0 [2,] 0 0 0 0 0 0 [3,] 0 0 0 0 0 0 [4,] 0 0 0 0 0 0 [5,] 0 0 0 0 0 0 [6,] 0 0 0 0 0 0 [7,] 0 0 0 0 0 0 [8,] 0 0 0 0 0 0 [9,] 0 0 0 0 0 0 [10,] 1 1 1 1 1 1 [11,] 1 1 1 1 1 1 [12,] 1 1 1 1 1 1 [13,] 1 1 1 1 1 1 [14,] 2 2 2 2 2 2 [15,] 2 2 2 2 2 2 [16,] 2 2 2 2 2 2 [17,] 3 3 3 3 3 3 [18,] 3 3 3 3 3 3 [19,] 3 3 3 3 3 3 [20,] 3 3 3 3 3 3 Petr. --- R-devel/src/nmath/qbinom.c 2007-07-25 17:54:27.0 +0200 +++ R-devel-qbinom/src/nmath/qbinom.c 2009-05-23 17:22:43.538566976 +0200 @@ -36,6 +36,7 @@ static double do_search(double y, double *z, double p, double n, double pr, double incr) { +double newz; if(*z >= p) { /* search to the left */ #ifdef DEBUG_qbinom @@ -43,9 +44,10 @@ #endif for(;;) { if(y == 0 || - (*z = pbinom(y - incr, n, pr, /*l._t.*/TRUE, /*log_p*/FALSE)) < p) + (newz = pbinom(y - incr, n, pr, /*l._t.*/TRUE, /*log_p*/FALSE)) < p) return y; y = fmax2(0, y - incr); + *z = newz; } } else { /* search to the right */ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] inconsistency in ?factor
In the almost current development version (2009-05-22 r48594) and also in https://svn.r-project.org/R/trunk/src/library/base/man/factor.Rd ?factor contains (compare the formulations marked by ^^) \section{Warning}{ The interpretation of a factor depends on both the codes and the \code{"levels"} attribute. Be careful only to compare factors with the same set of levels (in the same order). ^ \section{Comparison operators and group generic methods}{ ... Only \code{==} and \code{!=} can be used for factors: a factor can only be compared to another factor with an identical set of levels (not necessarily in the same ordering) or to a character vector. In the development version 2009-05-22 r48594, the latter formulation "not necessarily in the same ordering" is correct. f1 <- factor(c("a", "b", "c", "c", "b", "a", "a"), levels=c("a", "b", "c")) f2 <- factor(c("a", "b", "c", "c", "b", "a", "c"), levels=c("c", "b", "a")) f1 == f2 # [1] TRUE TRUE TRUE TRUE TRUE TRUE FALSE The first formulation "Be careful to compare ... levels in the same order" may be just a warning against a potential problem if the levels have different order, however this is not clear. Petr. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] inconsistency in ?factor
On Mon, May 25, 2009 at 03:58:06PM +0800, Berwin A Turlach wrote: > Well, the first statement is a remark on comparison in general while > the second statement is specific to "comparison operators and generic > methods". There are other ways of comparing objects; note: > > R> f1 <- factor(c("a", "b", "c", "c", "b", "a"), levels=c("a", "b", "c")) > R> f2 <- factor(c("a", "b", "c", "c", "b", "a"), levels=c("c", "b", "a")) > R> f1==f2 > [1] TRUE TRUE TRUE TRUE TRUE TRUE > R> identical(f1,f2) > [1] FALSE > R> all.equal(f1,f2) > [1] "Attributes: < Component 2: 2 string mismatches >" I see. We have to distinguish comparison of the objects and their components. Let me propose the following formulation Two factors may be identical only if they have the same set of levels (in the same order). instead of Be careful only to compare factors with the same set of levels (in the same order). Petr. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] as.numeric(levels(factor(x))) may be a decreasing sequence
On Wed, May 27, 2009 at 10:51:38PM +0200, Martin Maechler wrote: > I have very slightly modified the changes (to get rid of -Wall > warnings) and also exported the function as Rf_dropTrailing0(), > and tested the result with 'make check-all' . Thank you very much for considering the patch. -Wall indeed requires to add parentheses warning: suggest parentheses around comparison in operand of & warning: suggest parentheses around assignment used as truth value If there are also other changes, i would like to ask you to make your modification available, mainly due to a possible further discussion. Let me also suggest a modification of my original proposal. It contains a cycle while (*(replace++) = *(p++)) { ; } If the number has no trailing zeros, but contains an exponent, this cycle shifts the exponent by 0 positions, which means that it copies each of its characters to itself. This may be eliminated as follows if (replace != p) { while (*(replace++) = *(p++)) { ; } } Petr. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] as.numeric(levels(factor(x))) may be a decreasing sequence
On Fri, May 29, 2009 at 03:53:02PM +0200, Martin Maechler wrote: > my version of *using* the function was > > 1 SEXP attribute_hidden StringFromReal(double x, int *warn) > 2 { > 3 int w, d, e; > 4 formatReal(&x, 1, &w, &d, &e, 0); > 5 if (ISNA(x)) return NA_STRING; > 6 else return mkChar(dropTrailing0(EncodeReal(x, w, d, e, OutDec), OutDec)); > 7 } > > where you need to consider that mkChar() expects a 'const char*' > and EncodeReal(.) returns one, and I am pretty sure this was the > main reason why Petr had used the two 'const char*' in (the > now-named) dropTrailing0() definition. Yes, the goal was to accept the output of EncodeReal() with exactly the same type, which EncodeReal() produces. A question is, whether the output type of EncodeReal() could be changed to (char *). Then, changing the output string could be done without casting const to non-const. This solution may be in conflict with the structure of the rest of R code, so i cannot evaluate, whether this is possible. Petr. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] as.numeric(levels(factor(x))) may be a decreasing sequence
On Sat, May 30, 2009 at 07:32:52PM +0200, Martin Maechler wrote: > > "vQ" == Wacek Kusnierczyk > > on Sat, 30 May 2009 11:16:43 +0200 writes: [...] > vQ> one simple way to improve the code is as follows; instead of > (simplified) > > vQ> const char* dropTrailing(const char* s, ...) { > vQ> const char *p = s; > vQ> char *replace; > vQ> ... > vQ> replace = (char*) p; > vQ> ... > vQ> return s; } > > vQ> ...mkChar(dropTrailing(EncodeReal(...), ...) ... > > vQ> you can have something like > > vQ> const char* dropTrailing(char* s, ...) { > vQ> char *p = s, *replace; > vQ> ... > vQ> replace = p; > vQ> ... > vQ> return s; } > > vQ> ...mkChar(dropTrailing((char*)EncodeReal(...), ...) ... > > vQ> where it is clear, from DT's signature, that it may (as it > purposefully > vQ> does, in fact) modify the content of s. that is, you drop the > vQ> promise-not-to-modify contract in DT, and move the need for > vQ> deconstifying ER's return out of DT, making it more explicit. [...] > vQ> (3) modify petr's solution along the lines above, i.e., have the input > vQ> in the signature non-const and deconst-cast the output from ER outside > vQ> of the call to DT. > > that's what I have adopted, as I'm sure you've noticed when you > saw the code above. I appreciate the current version, which contains static const char* dropTrailing0(char *s, char cdec) ... mkChar(dropTrailing0((char *)EncodeReal(x, w, d, e, OutDec), ... Here, is better visible that the cast (char *) is used than if it was hidden inside dropTrailing0(). Also, it makes dropTrailing0() more consistent. I would like to recall the already discussed modification if (replace != p) while((*(replace++) = *(p++))) ; which saves a few instructions in the more frequent case that there are no trailing zeros. Petr. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] read.csv
On Sun, Jun 14, 2009 at 09:21:24PM +0100, Ted Harding wrote: > On 14-Jun-09 18:56:01, Gabor Grothendieck wrote: > > If read.csv's colClasses= argument is NOT used then read.csv accepts > > double quoted numerics: > > > > 1: > read.csv(stdin()) > > 0: A,B > > 1: "1",1 > > 2: "2",2 > > 3: > > A B > > 1 1 1 > > 2 2 2 > > > > However, if colClasses is used then it seems that it does not: > > > >> read.csv(stdin(), colClasses = "numeric") > > 0: A,B > > 1: "1",1 > > 2: "2",2 > > 3: > > Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, > > na.strings, : > > scan() expected 'a real', got '"1"' > > > > Is this really intended? I would have expected that a csv file > > in which each field is surrounded with double quotes is acceptable > > in both cases. This may be documented as is yet seems undesirable > > from both a consistency viewpoint and the viewpoint that it should > > be possible to double quote fields in a csv file. > > Well, the default for colClasses is NA, for which ?read.csv says: > [...] > Possible values are 'NA' (when 'type.convert' is used), > [...] > and then ?type.convert says: > This is principally a helper function for 'read.table'. Given a > character vector, it attempts to convert it to logical, integer, > numeric or complex, and failing that converts it to factor unless > 'as.is = TRUE'. The first type that can accept all the non-missing > values is chosen. > > It would seem that type 'logical' won't accept integer (naively one > might expect 1 --> TRUE, but see experiment below), so the first > acceptable type for "1" is integer, and that is what happens. > So it is indeed documented (in the R[ecursive] sense of "documented" :)) > > However, presumably when colClasses is used then type.convert() is > not called, in which case R sees itself being asked to assign a > character entity to a destination which it has been told shall be > integer, and therefore, since the default for as.is is > as.is = !stringsAsFactors > but for this ?read.csv says that stringsAsFactors "is overridden > bu [sic] 'as.is' and 'colClasses', both of which allow finer > control.", so that wouldn't come to the rescue either. > > Experiment: > X <-logical(10) > class(X) > # [1] "logical" > X[1]<-1 > X > # [1] 1 0 0 0 0 0 0 0 0 0 > class(X) > # [1] "numeric" > so R has converted X from class 'logical' to class 'numeric' > on being asked to assign a number to a logical; but in this > case its hands were not tied by colClasses. > > Or am I missing something?!! In my opinion, you explain, how it happens that there is a difference in the behavior between read.csv(stdin(), colClasses = "numeric") and read.csv(stdin()) but not, why it is so. The algorithm "use the smallest type, which accepts all non-missing values" may well be applied to the input values either literally or after removing the quotes. Is there a reason, why read.csv(stdin(), colClasses = "numeric") removes quotes from the input values and read.csv(stdin()) does not? Using double-quote characters is a part of the definition of CSV file, see, for example http://en.wikipedia.org/wiki/Comma_separated_values where one may find Fields may always be enclosed within double-quote characters, whether necessary or not. Petr. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] read.csv
On Sun, Jun 14, 2009 at 02:56:01PM -0400, Gabor Grothendieck wrote: > If read.csv's colClasses= argument is NOT used then read.csv accepts > double quoted numerics: > > 1: > read.csv(stdin()) > 0: A,B > 1: "1",1 > 2: "2",2 > 3: > A B > 1 1 1 > 2 2 2 > > However, if colClasses is used then it seems that it does not: > > > read.csv(stdin(), colClasses = "numeric") > 0: A,B > 1: "1",1 > 2: "2",2 > 3: > Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : > scan() expected 'a real', got '"1"' > > Is this really intended? I would have expected that a csv file in which > each field is surrounded with double quotes is acceptable in both > cases. This may be documented as is yet seems undesirable from > both a consistency viewpoint and the viewpoint that it should be > possible to double quote fields in a csv file. The problem is not specific to read.csv(). The same difference appears for read.table(). read.table(stdin()) "1" 1 2 "2" # V1 V2 # 1 1 1 # 2 2 2 but read.table(stdin(), colClasses = "numeric") "1" 1 2 "2" Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : scan() expected 'a real', got '"1"' The error occurs in the call of scan() at line 152 in src/library/utils/R/readtable.R, which is data <- scan(file = file, what = what, sep = sep, quote = quote, ... (This is the third call of scan() in the source code of read.table()) In this call, scan() gets the types of columns in "what" argument. If the type is specified, scan() performs the conversion itself and fails, if a numeric field is quoted. If the type is not specified, the output of scan() is of type character, but with quotes eliminated, if there are some in the input file. Columns with unknown type are then converted using type.convert(), which receives the data already without quotes. The call of type.convert() is contained in a cycle for (i in (1L:cols)[do]) { data[[i]] <- if (is.na(colClasses[i])) type.convert(data[[i]], as.is = as.is[i], dec = dec, na.strings = character(0L)) ## as na.strings have already been converted to else if (colClasses[i] == "factor") as.factor(data[[i]]) else if (colClasses[i] == "Date") as.Date(data[[i]]) else if (colClasses[i] == "POSIXct") as.POSIXct(data[[i]]) else methods::as(data[[i]], colClasses[i]) } which contains also lines, which could perform conversion for columns with a specified type, but these lines are not used, since the vector "do" is defined as do <- keep & !known where "known" determines for which columns the type is known. It is possible to modify the code so that scan() is called with all types unspecified and leave the conversion to the lines else if (colClasses[i] == "factor") as.factor(data[[i]]) else if (colClasses[i] == "Date") as.Date(data[[i]]) else if (colClasses[i] == "POSIXct") as.POSIXct(data[[i]]) else methods::as(data[[i]], colClasses[i]) above. Since this solution is already prepared in the code, the patch is very simple --- R-devel/src/library/utils/R/readtable.R 2009-05-18 17:53:08.0 +0200 +++ R-devel-readtable/src/library/utils/R/readtable.R 2009-06-25 10:20:06.0 +0200 @@ -143,9 +143,6 @@ names(what) <- col.names colClasses[colClasses %in% c("real", "double")] <- "numeric" -known <- colClasses %in% -c("logical", "integer", "numeric", "complex", "character") -what[known] <- sapply(colClasses[known], do.call, list(0)) what[colClasses %in% "NULL"] <- list(NULL) keep <- !sapply(what, is.null) @@ -189,7 +186,7 @@ stop(gettextf("'as.is' has the wrong length %d != cols = %d", length(as.is), cols), domain = NA) -do <- keep & !known # & !as.is +do <- keep & !as.is if(rlabp) do[1L] <- FALSE # don't convert "row.names" for (i in (1L:cols)[do]) { data[[i]] <- (Also in attachment) I did a test as follows d1 <- read.table(stdin()) "1" TRUE 3.5 2 NA "0.1" NA FALSE 0.1 3 "TRUE" NA sapply(d1, typeof) #V1V2V3 # "integer" "logical" "double" is.na(d1) # V1V2V3 # [1,] FALSE FALSE FALSE # [2,] FALSE TRUE FALSE # [3,] TRUE FALSE FALSE # [4,] FALSE FALSE TRUE d2 <- read.table(stdin(), colClasses=c("integer", "logical", "double")) "1" TRUE 3.5 2 NA "0.1" NA FALSE 0.1 3 "TRUE" NA sapply(d2, typeof) #V1V2V3 # "integer" "logical" "double" is.na(d2) # V1V2V3 # [1,] FALSE FALSE FALSE # [2,] FALSE TRUE FALSE # [3,] TRUE FALSE FALSE # [4,] FALSE FALSE TRUE I think, there was a reason to let scan() to perform the type conversion, for example, it may be more efficient. So, if correct,
Re: [Rd] read.csv
I am sorry for not including the attachment mentioned in my previous email. Attached now. Petr. --- R-devel/src/library/utils/R/readtable.R 2009-05-18 17:53:08.0 +0200 +++ R-devel-readtable/src/library/utils/R/readtable.R 2009-06-25 10:20:06.0 +0200 @@ -143,9 +143,6 @@ names(what) <- col.names colClasses[colClasses %in% c("real", "double")] <- "numeric" -known <- colClasses %in% -c("logical", "integer", "numeric", "complex", "character") -what[known] <- sapply(colClasses[known], do.call, list(0)) what[colClasses %in% "NULL"] <- list(NULL) keep <- !sapply(what, is.null) @@ -189,7 +186,7 @@ stop(gettextf("'as.is' has the wrong length %d != cols = %d", length(as.is), cols), domain = NA) -do <- keep & !known # & !as.is +do <- keep & !as.is if(rlabp) do[1L] <- FALSE # don't convert "row.names" for (i in (1L:cols)[do]) { data[[i]] <- __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] factor() calls sort.list unnecessarily?
On Fri, Jul 03, 2009 at 07:57:49AM -0700, Martin Morgan wrote: [...] > sort.list is always called but used only to determine the order of > levels, so unnecessary when levels are provided. I think, this is correct. Replacing ind <- sort.list(x) by if (missing(levels)) ind <- sort.list(x) makes factor() more efficient, when levels parameter is not missing and since variable ind is not needed in this case, i think, the modification in the above form (without unique()) is correct. Parameter levels is not used between the two tests of missing(levels), so we get the same result in both cases as needed. > In addition, order of > levels is for unique values of x only. Perhaps these issues are > addressed in the patch below? It does require unique() on the original > argument x, rather than only on as.character(x) Computing the order of levels is sufficient for unique levels, however, if x is numeric, then the operation x <- as.character(x) performs rounding, due to which different, but very close numbers are mapped to the same value. So, the length of unique(x) may change. A possible solution could be to keep unique(x) for the original values and refer to it, when constructing the levels. For example, as follows. y <- unique(x) ind <- sort.list(y) y <- as.character(y) levels <- unique(y[ind]) x <- as.character(x) f <- match(x, levels) This is more efficient, if the length of unique(x) is significantly smaller then the length of x. On the other hand, if their lengths are similar, then computing as.character() on both x and y incures some slow down. > At the least, perhaps > sort.list can be called only when levels are not provided? I support this part of the patch without unique(). Petr. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] incorrect result (41/10-1/10)%%1 (PR#13863)
> I get an incorrect result for > > (41/10-1/10)%%1 > > [1] 1 Note that due to rounding errors, 41/10-1/10 is formatC(41/10-1/10, digits=20) # [1] "3.9995559" Besides FAQ 7.31, related information may be found also at http://wiki.r-project.org/rwiki/doku.php?id=misc:r_accuracy Petr. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Accuracy (PR#13867)
On Tue, Aug 04, 2009 at 04:25:09PM +0200, lueth...@student.ethz.ch wrote: > Hi > > I created the following vectors: > > p_1=c(0.2,0.2,0.2,0.2,0.1,0.25,0.4,0.1,0.25,0.4,0.1,0.25,0.4,0.1,0.25,0.4,0.2,0.5,0.8,0.2,0.5,0.8,0.2,0.5,0.8,0.2,0.5,0.8) > p_2=c(0,0,0,0,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.4,0.25,0.1,0.4,0.25,0.1,0.4,0.25,0.1,0.4,0.25,0.1) > > As these are probabilities, I calculated the remainder as > > p_3=1-p_1-p_2 > > There the values which ought to be 0.1 were lateron not recognised by > p_3==0.1, > but only if I used p_3 <= 0.1. > > The full calculation is actually bigger but the core issue remains: I used > values input by hand, built a difference and it was wrong. > > I know that exactly the value 0.1 is one that can not be represented using > binary rep. Maybe that's it, maybe not. Yes, this is the problem. In this case, one can obtain a correct result using round() p1 <- c(0.2,0.2,0.2,0.2,0.1,0.25,0.4,0.1,0.25,0.4,0.1,0.25,0.4,0.1, 0.25,0.4,0.2,0.5,0.8,0.2,0.5,0.8,0.2,0.5,0.8,0.2,0.5,0.8) p2 <- c(0,0,0,0,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.4, 0.25,0.1,0.4,0.25,0.1,0.4,0.25,0.1,0.4,0.25,0.1) p3 <- 1 - p1 - p2 round(p3, 2) == 0.1 # [1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE # [13] TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE # [25] TRUE FALSE FALSE TRUE Petr. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] identical(0, -0)
On Sat, Aug 08, 2009 at 10:39:04AM -0400, Prof. John C Nash wrote: > I'll save space and not include previous messages. > > My 2 cents: At the very least the documentation needs a fix. If it is > easy to do, then Ted Harding's suggestion of a switch (default OFF) to > check for sign difference would be sensible. > > I would urge inclusion in the documentation of the +0, -0 example(s) if > there is NOT a way in R to distinguish these. It is possible to distinguish 0 and -0 in R, since 1/0 == Inf and 1/(-0) == -Inf. I do not know, whether there are also other such situations. In particular (0)^(-1) == (-0)^(-1) # [1] TRUE log(0) == log(-0) # [1] TRUE > There are occasions where > it is useful to be able to detect things like this (and NaN and Inf and > -Inf etc.). They are usually not of interest to users, but sometimes are > needed for developers to check edge effects. For those cases it may be > time to consider a package FPIEEE754 or some similar name to allow > testing and possibly setting of flags for some of the fancier features. > Likely used by just a few of us in extreme situations. I think that distinguishing 0 and -0 may be useful even for nonexpert users for debugging purposes. Mainly, because x == y does not imply that x and y behave equally as demonstrated above or by x <- 0 y <- - 0 x == y # [1] TRUE 1/x == 1/y # [1] FALSE I would like to recall the suggestion On Sat, Aug 08, 2009 at 03:04:07PM +0200, Martin Maechler wrote: > Maybe we should introduce a function that's basically > isTRUE(all.equal(..., tol=0)) {but faster}, or > do you want a 3rd argument to identical, say 'method' > with default c("oneNaN", "use.==", "strict") > > oneNaN: my proposal of using memcmp() on doubles as its used for >other types already (and hence distinguishing +0 and -0; > otherwise keeping the feature that there's just one NaN > which differs from 'NA' (and there's just one 'NA'). > > use.==: the previous R behaviour, using '==' on doubles > (and the "oneNaN" behavior) > > strict: be even stricter than oneNaN: Use memcmp() > unconditionally for doubles. This would be the fastest > version of all three. In my opinion, for debugging purposes, the option identical(x,y,method="strict"), which implies that x and y behave equally, could be useful, if it is available in R base, At the R interactive level, negative zero as the value of -0 could possibly be avoided. However, negative zero may also occur in numerical calculations, since it may be obtained as x * 0, where x is negative. So, i think, negative zero cannot be eliminated from consideration as something too infrequent. Petr. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] identical(0, -0)
On Mon, Aug 10, 2009 at 05:47:57AM -0400, Duncan Murdoch wrote: > I wouldn't mind a "strict" option. It would compare bit patterns, so > would distinguish +0 from -0, and different NaN values. I think that a logical option "strict" in the above meaning could be useful for debugging. The default may be FALSE. On Mon, Aug 10, 2009 at 10:20:39AM -0400, Duncan Murdoch wrote: > +0 and -0 are exactly equal, which is what identical is documented to be > testing. They are not indistinguishable, and not identical in the > English meaning of the word, but they are identical in the sense of what > the identical() function is documented to test. > > The cases where you want to distinguish between them are rare. They > should not be distinguished in the default identical() test, any more > than different values of NaN should be distinguished (and identical() is > explicitly documented *not* to distinguish those). [...] The question, whether 0 and -0 are equal or not, is not clear, since they have different reciprocals. However, i agree that distinguishing the signs of zero is rarely useful. From this point of view, the default FALSE seems to be acceptable. For completeness, let me also add an argument that it would not be too harmful, if the default is TRUE. I think that it is quite rare to have two larger numerical structures, which match up to the last bits in all numbers, but have a different sign of some zero. Matching all bits almost requires that the two structures are obtained using the same expressions for all components. Then, also the signs of zeros will match. However, i may be wrong. Petr. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] identical(0, -0)
On Tue, Aug 11, 2009 at 10:04:20AM +0200, Martin Maechler wrote: > > "DM" == Duncan Murdoch > > on Mon, 10 Aug 2009 11:51:53 -0400 writes: > > DM> For people who want to play with these, here are some functions that > let > DM> you get or set the "payload" value in a NaN. NaN and NA, Inf and > -Inf > DM> are stored quite similarly; these functions don't distinguish which > of > DM> those you're working with. Regular finite values give NA for the > DM> payload value, and elements of x are unchanged if you try to set > their > DM> payload to NA. > > DM> By the way, this also shows that R *can* distinguish different NaN > DM> values, but you need some byte-level manipulations. > > yes; very nice code, indeed! > > I propose a version of the showBytes() utility should be added > either as an example e.g. in writeBin() or even an exported > function in package 'utils' > > [.] > > > Example: > > >> x <- c(NA, NaN, 0, 1, Inf) > >> NaNpayload(x) > > [1] 0.5 -0.5 NA NA 0.0 > > Interestingly, on 64-bit, I get a slightly different answer above, > (when all the following code gives exactly the same results, > and of course, that was your main point !), namely > 4.338752e-13 instead of 0.5 for 'NA', > see below. > > .. and your nice tools also let me detect an even simpler way > to get *two* versions of NA, and NaN, each : > Conclusion: Both NaN and NA (well NA_real_) have a sign, too ! > > NaNpayload(NA_real_) > ##[1] 4.338752e-13 > NaNpayload(-NA_real_) > ##[1] -4.338752e-13 ## !! different > > str(NApm <- c(1[2], -1[2])) > t(sapply(NApm, showBytes)) > ## [1,] a2 07 00 00 00 00 f0* 7f > ## [2,] a2 07 00 00 00 00 f0* ff > > ## or summarizing things : > > ## Or, "in summary" -- Duncan's original example slightly extended: > x <- c(NaN, -NaN, NA, -NA_real_, 0, 0.1, Inf, -Inf) > x > names(x) <- format(x) > sapply(x, showBytes) > ## NaN NaN NA NA 0.0 0.1 Inf -Inf > ## [1,] 00 00 a2 a2 00 9a 00 00 > ## [2,] 00 00 07 07 00 99 00 00 > ## [3,] 00 00 00 00 00 99 00 00 > ## [4,] 00 00 00 00 00 99 00 00 > ## [5,] 00 00 00 00 00 99 00 00 > ## [6,] 00 00 00 00 00 99 00 00 > ## [7,] f8 f8 f8* f8* 00 b9 f0 f0 > ## [8,] ff 7f 7f ff 00 3f 7f ff > > ## (*) NOTE: the 'f0*' or 'f8*' above are > ## --- 'f8' on 32-bit, 'f0' on 64-bit > > > > >> NaNpayload(x) <- -0.4 > >> x > > [1] NaN NaN NaN NaN NaN > >> y <- x > >> NaNpayload(y) <- 0.6 > >> y > > [1] NaN NaN NaN NaN NaN > >> NaNpayload(x) > > [1] -0.4 -0.4 -0.4 -0.4 -0.4 > >> NaNpayload(y) > > [1] 0.6 0.6 0.6 0.6 0.6 > >> identical(x, y) > > [1] TRUE > The above examples convince me that the default behavior of identical() should not be based on bit patterns, since the differences between different NaN's or even different NA's are irrelevant except if we use the bit manipulations explicitly. Let me suggest the following short description in ?identical The safe and reliable way to test two objects for being equal in structure, types of components and their values. It returns 'TRUE' in this case, 'FALSE' in every other case. and replacing the paragraph 'identical' sees 'NaN' as different from 'NA_real_', but all 'NaN's are equal (and all 'NA' of the same type are equal). in ?identical by Comparison of objects of numeric type uses '==' for comparison of their components. This means that the values of the components rather than their machine representation is compared. In particular, '0' and '-0' are considered equal, all 'NA's of the same type are equal and all 'NaN's are equal, although their bit patterns may differ in some cases. 'NA' and 'NaN' are always different. Note also that 1/0 and 1/(-0) are different. Petr. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] identical(0, -0)
Let me add the following to the discussion of identical(0, -0). I would like to suggest to replace the paragraph 'identical' sees 'NaN' as different from 'NA_real_', but all 'NaN's are equal (and all 'NA' of the same type are equal). in ?identical by the following text, which is a correction of my previous suggestion for the same paragraph Components of numerical objects are compared as follows. For non-missing values, "==" is used. In particular, '0' and '-0' are considered equal. All 'NA's of the same type are equal and all 'NaN's are equal, although their bit patterns may differ in some cases. 'NA' and 'NaN' are always different. Note also that 1/0 and 1/(-0) are different. The suggestion for the default of identical(0, -0) is TRUE, because the negative zero is much less important than NA na NaN and, possibly, distinguishing 0 and -0 could even be deprecated. Moreover, the argument of efficiency of memcmp cannot be used here, since there are different variants of NaN and NA, which should not be distinguished by default. Petr. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] identical(0, -0)
On Wed, Aug 12, 2009 at 04:02:28PM +0200, Martin Maechler wrote: > >>>>> "PS" == Petr Savicky > >>>>> on Wed, 12 Aug 2009 13:50:46 +0200 writes: > > PS> Let me add the following to the discussion of identical(0, -0). > PS> I would like to suggest to replace the paragraph > > PS> 'identical' sees 'NaN' as different from 'NA_real_', but all > PS> 'NaN's are equal (and all 'NA' of the same type are equal). > > PS> in ?identical by the following text, which is a correction of my > previous > PS> suggestion for the same paragraph > > > Components of numerical objects are compared as follows. For non-missing > > values, "==" is used. In particular, '0' and '-0' are considered equal. > > All 'NA's of the same type are equal and all 'NaN's are equal, although > > their bit patterns may differ in some cases. 'NA' and 'NaN' are always > > different. > > Note also that 1/0 and 1/(-0) are different. > > the 'numerical' would have to be qualified ('double', 'complex' > via double), as indeed, memcmp() is used on integers > > The last sentence is not necessary and probably even confusing: > Of course, -Inf and Inf are different. I agree. > PS> The suggestion for the default of identical(0, -0) is TRUE, because > the > PS> negative zero is much less important than NA na NaN and, possibly, > PS> distinguishing 0 and -0 could even be deprecated. > > What should that mean?? R *is* using the international floating > point standards, and 0 and -0 exist there and they *are* > different! I am sorry for being too short. In my opinion, distinguishing 0 and -0 is not useful enough to make the default behavior of identical() different from the behavior of == in this case. > If R would start --- with a performance penalty, btw ! --- > to explicitly map all internal '-0' into '+0' we would > explicitly move away from the international FP standards... > no way! Yes, i agree. I did not meant this. > PS> Moreover, the argument > PS> of efficiency of memcmp cannot be used here, since there are different > PS> variants of NaN and NA, which should not be distinguished by default. > > your argument is only partly true... as memcmp() can still be > used instead of '==' *after* the NA-treatments {my current > patch does so}, OK. In this case, memcmp() could still be faster than ==, although this is beyond my knowledge. > and even more as I have been proposing an option "strict" which > would only use memcmp() {and hence also distinguish different > NA, NaN's}. I understand the previous messages in this thread as that there is an agreement that such an option would be very useful and would lead to faster comparison. Petr. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] user supplied random number generators
On Mon, Aug 17, 2009 at 12:25:57PM -0700, Ross Boylan wrote: > > Seeding by a vector is also available in the initialization of Mersenne > > Twister > > from 2002. See mt19937ar.c (ar for array) at > > http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/emt.html > > Unfortunately, seeding by a vector is not available in R base. R uses > > Mersenne Twister, but with an initialization by a single number. > I think that one could write to .Random.seed directly to set a vector > for many of the generators. ?.Random.seed does not recommend this and > notes various limits and hazards of this strategy. Initialization of a generator by writing the initial state into .Random.seed is possible, but requires to know the exact form of the admissible states of the generator. An error leads to a corrupted sequence. So, this is indeed not suggested. By seeding by a vector, i mean something different. The user provides a vector of arbitrary length and this vector is used as an input to a hash function, which produces a correct initial state. The advantage over seeding by a single number is that the set of possible initial states is much larger already if we use a seed of length 2 or 3. Seeding by a vector is available, for example, in Mersenne Twister from 2002 using the C function init_by_array() in mt19937ar.c. In the package rngwell19937, the R function set.vector.seed() may be used for the same purpose. I will reply to the rest of your message in a private email (as was also the email you were replying to). Petr Savicky. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] identical(0, -0)
On Sat, Aug 22, 2009 at 12:00:44AM +0200, Martin Maechler wrote: > I have taken up the issue now, > and after thinking, studying the source, trying to define a > 'method = ' argument, came to the conclusion that both > the implementation and documentation (and source code "self-explanation") > are easiest to program, maintain, and understand, > if I introduce explicit binary switches, > so I now propose the following R-level interface which keeps > the current behavior the default: > > >> Usage: > >> > >> identical(x, y, num.EQ = TRUE, one.NA = TRUE, attrib.asSet = TRUE) > >> > >> Arguments: > >> > >> x, y: any R objects. > >> > >> num.EQ: logical indicating if ('double' and 'complex' non-'NA') > >> numbers should be compared using '==', or by bitwise > >> comparison. The latter (non-default) differentiates between > >> '-0' and '+0'. > >> > >> one.NA: logical indicating if there is conceptually just one numeric > >> 'NA' and one 'NaN'; 'one.NA = FALSE' differentiates bit > >> patterns. > >> > >> attrib.asSet: logical indicating if 'attributes' of 'x' and 'y' should > >> be treated as _unordered_ tagged pairlists ("sets"); this > >> currently also applies to 'slot's of S4 objects. It may well > >> be too strict to set 'attrib.asSet = FALSE'. I appreciate having several binary switches. Besides the arguments above, this is also useful for an interactive use of identical(), for example, for debugging purposes. If there is a difference between objects, then the switches allow to get more information concerning what is the type of the difference. > I'm open for better names of arguments, but will not accept "_" > in the argument names {just my taste; no reason for argueing...}. I would slightly prefere one.NaN instead of one.NA. In IEEE 754 terminology, R's 'NA's are a subset of 'NaN's. So. NaN is a bit more general notion, although in R, the sets of 'NA's an 'NaN's are disjoint. Moreover, the name one.NaN specifies more clearly, that the issue is important only for numeric types and not, for example, for integer. Petr. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Unexpected behaviour of seq(from,to,by) (PR#14057)
> # Hi there. > # I'm not sure whether or not this is a bug. No, this is not a bug. See FAQ 7.31 at http://cran.r-project.org/doc/FAQ/R-FAQ.html and the comments below. > # But it surely is an unexpected behaviour. > > V <- seq(from=0,to=1,by=0.1) > > # Should generate a sequence with a step of 0.1 > > V==0 > # [1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE > # Ok! > > V==0.1 > # [1] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE > # Ok! > > V==0.6 > # [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE > # None? > > V[7] > # [1] 0.6 > V[7]==0.6 > # [1] FALSE > # Rounding!? Yes. The sequence of operations, which leads to V[7], produce slightly different rounding error than the direct conversion of 0.6 to binary representation. So, we have formatC(V[7], digits=20) # [1] "0.60008882" formatC(0.6, digits=20) # [1] "0.5999778" See http://wiki.r-project.org/rwiki/doku.php?id=misc:r_accuracy for more examples and explanations. Petr Savicky. > V[8] > # [1] 0.7 > V[8]==0.7 > # [1] FALSE > # Rounding!? > > V[9] > # [1] 0.8 > V[9]==0.8 > # [1] TRUE > # Back to normal > > # The generated sequence is fine for all values except for 0.6 > # and 0.7, which appear to be somewhat off the expected value. > > # According to the R manual the sequence is generated in the form: > # from, from+by, ..., to > # which is not the case. > > # Either the function or the documentation lead to an unexpected > # behaviour of seq(from,to,by). > > # Thanks! > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] cor.test(method = spearman, exact = TRUE) not exact (PR#14095)
On Mon, Nov 30, 2009 at 04:00:12AM +0100, dsim...@gmail.com wrote: > > a <- c(1:10) > > b <- c(1:10) > > cor.test(a, b, method = "spearman", alternative = "greater", exact = TRUE) > > Spearman's rank correlation rho > > data: a and b > S = 0, p-value < 2.2e-16 > alternative hypothesis: true rho is greater than 0 > sample estimates: > rho > 1 > > > 1 / factorial(10) > [1] 2.755732e-07 > > Since we have perfect rank correlation and only one permutation out of 10! > could > give this for N = 10, the p-value should be 1/10!. Reading the code in > prho.c, > it appears that the "exact" calculation uses the Edgeworth approximation for > N > > 9. This makes sense because, for similar examples with N <= 9, the results > are > as expected (1 / N!). The Edgeworth approximation is not exact, although the error is quite small. Computing the exact values has time complexity roughly 2^n, if Ryser's formula for permanent is used. In cases, where one needs the exact values for small n, it is possible to use the package pspearman at CRAN, which includes precomuted values up to n=22. See also the paper M.A. van de Wiel and A. Di Bucchianico, Fast computation of the exact null distribution of Spearman's rho and Page's L statistic for samples with and without ties, J. Stat. Plann. Inf. 92 (2001), pp. 133-145. which deals with this question. library(pspearman) a <- c(1:10) b <- c(1:10) out <- spearman.test(a, b, alternative = "greater", approximation="exact") out # Spearman's rank correlation rho # data: a and b # S = 0, p-value = 2.756e-07 # alternative hypothesis: true rho is greater than 0 # sample estimates: # rho # 1 out$p.value - 1/factorial(10) # [1] 0 PS. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] R CMD check may not detect a code/documentation mismatch
For the package at http://www.cs.cas.cz/~savicky/R-devel/something_0.0.0.tar.gz which is a minor part of some other package only to demonstrate the problem, i get (under R version 2.11.0 Under development 2009-12-12 r50714 and also under R-2.9.2, openSUSE 11.1 (x86_64) and CentOS release 5.2) R CMD check something_0.0.0.tar.gz ... * checking Rd files ... OK * checking Rd metadata ... OK * checking Rd cross-references ... OK * checking for missing documentation entries ... OK * checking for code/documentation mismatches ... OK * checking Rd \usage sections ... OK * checking examples ... NONE * checking PDF version of manual ... OK although the package code contains testCoreNA <- function() and the documentation contains \usage{ testCoreClass(verbose=0) testCoreAttrEval(verbose=0) testCoreReg(verbose=0) testCoreNA(verbose=0) } There is a mismatch between code and documentation of testCoreNA(). Is the problem caused by having four entries in \usage{} section? Petr Savicky. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] R CMD check may not detect a code/documentation mismatch
On Mon, Dec 14, 2009 at 09:24:12AM +0100, Kurt Hornik wrote: > >>>>> Peter Dalgaard writes: [...] > > Hmm, looks more like a thinko in this code inside codoc(): > > > functions_in_code <- Filter(function(f) { > > f <- get(f, envir = code_env) > > is.function(f) && (length(formals(f)) > 0L) > > }, objects_in_code) > > > which, further down the line, causes functions with no formal arguments > > to be skipped when compared to the usage section. > > > Browse[2]> > > debug: ind <- (!functions %in% functions_to_be_ignored & functions %in% > > functions_in_code) > > Browse[2]> functions > > [1] "testCoreClass""testCoreAttrEval" "testCoreReg" > > "testCoreNA" > > Browse[2]> > > debug: bad_functions <- mapply(functions[ind], exprs[ind], FUN = > > function(x, > > y) check_codoc(x, as.pairlist(as.alist.call(y[-1L]))), SIMPLIFY = > > FALSE) > > Browse[2]> ind > > [1] TRUE TRUE TRUE FALSE > > > I.e. testCoreNA is never tested by check_codoc. There may of course be > > a rationale for this, but it escapes me... > > Well, I am sure I had good reasons when I wrote the code many years ago, > but of course I no longer recall what they were. > > Did you try the effect of removing the length(formals(f)) test? I understand this as the question, what is the influence of the following patch (indented by two spaces). diff --minimal -U 3 -r R-devel_2009-12-13/src/library/tools/R/QC.R R-devel_length_formals/src/library/tools/R/QC.R --- R-devel_2009-12-13/src/library/tools/R/QC.R 2009-10-27 02:59:10.0 +0100 +++ R-devel_length_formals/src/library/tools/R/QC.R 2009-12-14 18:45:55.0 +0100 @@ -398,7 +398,7 @@ functions_in_code <- Filter(function(f) { f <- get(f, envir = code_env) - is.function(f) && (length(formals(f)) > 0L) + is.function(f) }, objects_in_code) ## Sourcing all R code files in the package is a problem for base, Since i do not know this part of R code, i simply applied the patch to the current R-devel and run "make" and "make check" (both OK) and compared the output of R CMD check XML_2.6-0.tar.gz R CMD check randomForest_4.5-33.tar.gz R CMD check tree_1.0-27.tar.gz R CMD check survival_2.35-7.tar.gz with R being both original R-devel and the modified one. The results were identical in all four cases (empty diff) on two Linux machines (CentOS and openSUSE). On the other hand, for the package http://www.cs.cas.cz/~savicky/R-devel/something_0.0.0.tar.gz which demonstrates the problem, the modified R detected the code/doc mismatch. The diff of the two outputs of R CMD check was [savi...@uivtx test]$ diff original modified 35c35,42 < * checking for code/documentation mismatches ... OK --- > * checking for code/documentation mismatches ... WARNING > Codoc mismatches from documentation object 'testCore': > testCoreNA > Code: function() > Docs: function(verbose = 0) > Argument names in docs not in code: > verbose > 39a47,50 > WARNING: There was 1 warning, see > /home/savicky/tmp/test/something.Rcheck/00check.log > for details > Petr Savicky. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Accuracy (PR#14139)
On Mon, Dec 14, 2009 at 06:10:16PM +0100, ber...@lycos.com wrote: > > > pnorm(1.35,0,1) > [1] 0.911492 > > pnorm(1.36,0,1) > [1] 0.913085 > > > options(digits=4) > > > pnorm(1.35,0,1) > [1] 0.9115 > > pnorm(1.36,0,1) > [1] 0.913 rounding error? The technical explanation is as follows. If options(digits=k) is set, then the number of significant digits for printing a single number x is determined as min(k, d), where d is the minimum number of digits, for which the relative error of the printed number is less than 10^-k. If we have x <- 0.913085 y <- 0.913 then the relative error of y as an approximation of x is abs(y - x)/x # [1] 9.3091e-05 Since this is less than 10^-4, the 3 digit precision is chosen for printing x. A safer way of rounding is to use functions round() and signif(). For example, round(x, digits=4) # [1] 0.9131 I do not know the history of the R printing algorithm. It is designed primarily for printing vectors, where the rules are more complicated to achieve a good unified format for all numbers. May be, someone else can say more about it. The above analysis may be obtained by inspecting the R source code. Petr Savicky. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] RFC: lchoose() vs lfactorial() etc
On Tue, Dec 15, 2009 at 09:49:28AM +0100, Martin Maechler wrote: > lgamma(x) and lfactorial(x) are defined to return > > ln|Gamma(x)| {= log(abs(gamma(x)))} or ln|Gamma(x+1)| respectively. > > Unfortunately, we haven't chosen the analogous definition for > lchoose(). > > So, currently > > > lchoose(1/2, 1:10) >[1] -0.6931472 -2.0794415NaN -3.2425924NaN -3.8869494 >[7]NaN -4.3357508NaN -4.6805913 > Warning message: > In lchoose(n, k) : NaNs produced > > > > which (the NaN's) is not particularly useful. > (I have use case where I really have to workaround those NaNs.) > > I herebey propose to *amend* the definition of lchoose() such > that it behaves analogously to lgamma() and lfactorial(), > i.e., to return > >log(abs(choose(.,.)) > > Your comments are welcome. > Martin Maechler, ETH Zurich I do not have a strong opinion, whether lchoose() should be log(choose(.,.)) or log(abs(choose(.,.))), although i would slightly prefere the latter (with abs()). One of the reasons is that this would simplify the code of the function lchoose() in src/nmath/choose.c. It produces the NA by explicit testing the sign, so this test could be simply removed. Let me also point out that the current implementation seems to contain a bug, which causes that it is neither log(choose(.,.)) nor log(abs(choose(.,.))). x <- choose(1/2, 1:10) y <- lchoose(1/2, 1:10) cbind(choose=x, lchoose=y, log.choose=log(abs(x))) chooselchoose log.choose # [1,] 0.5 -0.6931472 -0.6931472 # [2,] -0.12500 -2.0794415 -2.0794415 # [3,] 0.06250NaN -2.7725887 # [4,] -0.039062500 -3.2425924 -3.2425924 # [5,] 0.027343750NaN -3.5992673 # [6,] -0.020507812 -3.8869494 -3.8869494 # [7,] 0.016113281NaN -4.1281114 # [8,] -0.013092041 -4.3357508 -4.3357508 # [9,] 0.010910034NaN -4.5180723 # [10,] -0.009273529 -4.6805913 -4.6805913 So, besides x[1], we get NA for those cases, when choose() is positive. The reason seems to be the test at line 89 of src/nmath/choose.c, which is if (fmod(floor(n-k+1), 2.) == 0) /* choose() < 0 */ return ML_NAN; but it should be negated. I tested it using if (fmod(floor(n-k+1), 2.) != 0) /* choose() < 0 */ return ML_NAN; when we get chooselchoose log.choose [1,] 0.5 -0.6931472 -0.6931472 [2,] -0.12500NaN -2.0794415 [3,] 0.06250 -2.7725887 -2.7725887 [4,] -0.039062500NaN -3.2425924 [5,] 0.027343750 -3.5992673 -3.5992673 [6,] -0.020507812NaN -3.8869494 [7,] 0.016113281 -4.1281114 -4.1281114 [8,] -0.013092041 NaN -4.3357508 [9,] 0.010910034 -4.5180723 -4.5180723 [10,] -0.009273529NaN -4.6805913 Petr Savicky. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] RFC: lchoose() vs lfactorial() etc
only for integers, since choose(n, k) is a polynomial of degree k. So, for different values of k, we get a different functions on any interval of positive length. The division at the line r /= j; always produces an exact integer, since during the whole cycle, r is always either an integer binomial coefficient or its integer multiple. So, the division has an integer result and there is no rounding error unless the intermediate result does not fit into 53 bit precison. Using the above code, we get options(digits=15) n <- 5 + (-15:15)*1e-8 cbind(n, choose(n, 5)) #n # [1,] 4.9985 0.99657500042 # [2,] 4.9986 0.9968070 # [3,] 4.9987 0.99703166698 # [4,] 4.9988 0.9972627 # [5,] 4.9989 0.99748833356 # [6,] 4.9990 0.9977185 # [7,] 4.9991 0.99794500014 # [8,] 4.9992 0.9981745 # [9,] 4.9993 0.99840166677 # [10,] 4.9994 0.9986308 # [11,] 4.9995 0.9988589 # [12,] 4.9996 0.9990870 # [13,] 4.9997 0.9993152 # [14,] 4.9998 0.9995435 # [15,] 4. 0.9997717 # [16,] 5. 1.000 # [17,] 5.0001 1.0002284 # [18,] 5.0002 1.0004567 # [19,] 5.0003 1.0006851 # [20,] 5.0004 1.0009136 # [21,] 5.0005 1.00114166671 # [22,] 5.0006 1.0013706 # [23,] 5.0007 1.00159833342 # [24,] 5.0008 1.0018280 # [25,] 5.0009 1.00205500016 # [26,] 5.0010 1.0022853 # [27,] 5.0011 1.00251166690 # [28,] 5.0012 1.0027427 # [29,] 5.0013 1.00296833365 # [30,] 5.0014 1.00319666704 # [31,] 5.0015 1.00342500042 Within the chosen accuracy 1e-8, there is no visible jump. With 1e-15, we get options(digits=15) n <- 5 + (-15:15)*1e-15 cbind(n, choose(n, 5)) # n # [1,] 4.98 0.966 # [2,] 4.99 0.967 # [3,] 4.99 0.970 # [4,] 4.99 0.972 # [5,] 4.99 0.976 # [6,] 4.99 0.978 # [7,] 4.99 0.980 # [8,] 4.99 0.982 # [9,] 4.99 0.984 # [10,] 4.99 0.986 # [11,] 4.99 0.988 # [12,] 5.00 0.990 # [13,] 5.00 0.994 # [14,] 5.00 0.996 # [15,] 5.00 0.998 # [16,] 5.00 1.000 # [17,] 5.00 1.002 # [18,] 5.00 1.004 # [19,] 5.00 1.006 # [20,] 5.00 1.010 # [21,] 5.01 1.012 # [22,] 5.01 1.014 # [23,] 5.01 1.016 # [24,] 5.01 1.018 # [25,] 5.01 1.020 # [26,] 5.01 1.022 # [27,] 5.01 1.024 # [28,] 5.01 1.028 # [29,] 5.01 1.031 # [30,] 5.01 1.032 # [31,] 5.02 1.035 So, the jump in the exactly integer value n[16] == 5 is comparable to machine accuracy. I would be pleased to provide more tests, if some of the above solutions or their modifications can be accepted. Petr Savicky. --- R-devel_2009-12-16/src/nmath/choose.c 2008-03-17 17:52:35.0 +0100 +++ R-devel_less_conservative/src/nmath/choose.c2009-12-17 22:43:54.0 +0100 @@ -107,19 +107,20 @@ #endif if (fabs(k - k0) > 1e-7) MATHLIB_WARNING2(_("'k' (%.2f) must be integer, rounded to %.0f"), k0, k); if (k < k_small_max) { int j; - if(n-k < k && n >= 0 && R_IS_INT(n)) k = n-k; /* <- Symmetry */ + if(n-k < k && n >= 0 && (n == floor(n))) k = n-k; /* <- Symmetry */ if (k < 0) return 0.; if (k == 0) return 1.; /* else: k >= 1 */ r = n; - for(j = 2; j <= k; j++) - r *= (n-j+1)/j; - return R_IS_INT(n) ? floor(r + 0.5) : r; - /* might have got rounding errors */ + for(j = 2; j <= k; j++) { + r *= (n-j+1); + r /= j; + } + return r; } /* else: k >= k_small_max */ if (n < 0) { r = choose(-n+ k-1, k); if (ODD(k)) r = -r; __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] choose(n,k) when n is almost integer
the machine, which i have now available. I will perform the whole test later. Both patches and also the original implementation sometimes generate a warning of the following type > choose(19 - 2e-7, 30) [1] -3.328339e-16 Warning message: In choose(19 - 2e-07, 30) : full precision may not have been achieved in 'lgamma' These warnings are generated in functions called from choose() and the intention of this patch is not to change this. Most of these warnings are eliminated in the original implementation by treating numbers, which differ from an integer by less than 1e-7, as integers. However, there are some such cases even if the difference is slightly larger than 1e-7 like the above 2e-7. The cases, when the above warning is generated, are the same for both patches A and B in the tests, which i did. Patch B is significantly more accurate in cases, when the result is more than 1. In order to verify these two claims, i used the test script at http://www.cs.cas.cz/~savicky/R-devel/test_choose_0.R Its outputs with A and B are presented below. The list of cases with warning Cases with warning [,1] [,2] [1,] 6715 31 [2,] 152 33 [3,] 152 34 [4,] 6393 37 [5,] 9388 39 [6,] 6059 40 [7,] 8773 40 [8,] 5058 44 [9,] 5531 44 [10,] 6670 46 [11,] 652 47 [12,] 2172 49 [13,] 3371 49 [14,] 1913 50 is the same in all four tests. So, i only include the tables of errors, which are different. The table contains columns "bound" and "error". The error is the maximum error achieved in cases, where the compared values are at most the bound on the same row and more than the bounds on the previous rows. Patch A on Intel extended arithmetic minimum log2() of a wrong result for integer n : 53.95423 maximum error for real n : bounderror [1,] 0e+00 1.206026e-08 [2,] 1e-20 2.699943e-08 [3,] 1e-15 3.444711e-08 [4,] 1e-10 2.806709e-08 [5,] 1e-05 8.395916e-09 [6,] 1e+00 3.695634e-07 [7,] Inf 2.990987e-07 Patch A on SSE arithmetic minimum log2() of a wrong result for integer n : 49.94785 maximum error for real n : bounderror [1,] 0e+00 1.206026e-08 [2,] 1e-20 2.699943e-08 [3,] 1e-15 3.444709e-08 [4,] 1e-10 2.806707e-08 [5,] 1e-05 8.395951e-09 [6,] 1e+00 3.695634e-07 [7,] Inf 2.990987e-07 Patch B on Intel extended arithmetic minimum log2() of a wrong result for integer n : 53.95423 maximum error for real n : bounderror [1,] 0e+00 2.047988e-09 [2,] 1e-20 2.699943e-08 [3,] 1e-15 3.444711e-08 [4,] 1e-10 2.806709e-08 [5,] 1e-05 8.395916e-09 [6,] 1e+00 1.207856e-11 [7,] Inf 1.509903e-14 Patch B on SSE arithmetic minimum log2() of a wrong result for integer n : 49.94785 maximum error for real n : bounderror [1,] 0e+00 2.047988e-09 [2,] 1e-20 2.699943e-08 [3,] 1e-15 3.444709e-08 [4,] 1e-10 2.806707e-08 [5,] 1e-05 8.395951e-09 [6,] 1e+00 1.205369e-11 [7,] Inf 2.731149e-14 We can see that B has smaller error, if the output is more than 1. For integer n, we can see that the smallest output values, where we do not get the exact result is the same for A and B. There is a difference between Intel extended arithmetic and SSE for clear reasons. I suggest to use B and i appreciate comments. Petr Savicky. --- R-devel-orig-sse/src/nmath/choose.c 2009-12-17 17:52:39.0 +0100 +++ R-devel-work-copy-1-sse/src/nmath/choose.c 2009-12-22 09:06:32.0 +0100 @@ -109,6 +109,7 @@ MATHLIB_WARNING2(_("'k' (%.2f) must be integer, rounded to %.0f"), k0, k); if (k < k_small_max) { int j; + if (R_IS_INT(n)) n = floor(n + 0.5); if(n-k < k && n >= 0 && R_IS_INT(n)) k = n-k; /* <- Symmetry */ if (k < 0) return 0.; if (k == 0) return 1.; @@ -126,6 +127,7 @@ return r; } else if (R_IS_INT(n)) { + n = floor(n + 0.5); if(n < k) return 0.; if(n - k < k_small_max) return choose(n, n-k); /* <- Symmetry */ return floor(exp(lfastchoose(n, k)) + 0.5); --- R-devel-orig-sse/src/nmath/choose.c 2009-12-17 17:52:39.0 +0100 +++ R-devel-work-copy-2-sse/src/nmath/choose.c 2009-12-22 13:49:33.0 +0100 @@ -93,13 +93,14 @@ return lfastchoose(n, k); } +#define IS_INT(x)((x) == floor(x)) #define k_small_max 30 /* 30 is somewhat arbitrary: it is on the *safe* side: * both speed and precision are clearly improved for k < 30. */ double choose(double n, double k) { -double r, k0 = k; +double r, k0 = k, n1; k = floor(k + 0.5); #ifdef IEEE_754 /* NaNs propagated correctly */ @@ -109,14 +110,14 @@ MATHLIB_WARNING2(_("'k' (%.2f) must be integer, rounded to %.0f"), k0, k); if (k < k_small_max) { int j; - if(n-k < k && n >= 0 &&am
Re: [Rd] choose(n,k) when n is almost integer
On Sat, Dec 19, 2009 at 10:02:49PM +0100, Martin Maechler wrote: [...] > Have you tried 'make check-devel' (or 'check-all') also with the > progressive change? The patch-B.txt from my previous email passed "make check-all" with the exception of the Tcl/Tk support, which is missing on my computer. The test was run both with Intel extended and SSE arithmetic. Petr Savicky. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] choose(n,k) when n is almost integer
In a previous email, i suggested two patches A and B to choose(n, k), which solve some of its problems, but keep some of the inaccuracies of the original implementation. I would like to suggest another patch, which i will call C (patch-C.txt in an attachment), which eliminates the warnings obtained sometimes from the original implementation and which is more accurate in all ranges of the output. For testing patch C a simpler script is sufficient, since we need not to take care of the warnings. Namely http://www.cs.cas.cz/~savicky/R-devel/test_choose_1.R which produces the output (the error is the relative error) > source("test_choose_1.R") k <= 0 max err = 0 k <= 10 max err = 1.332268e-15 k <= 20 max err = 2.442491e-15 k <= 30 max err = 3.774758e-15 k <= 40 max err = 2.553513e-14 k <= 50 max err = 2.88658e-14 k <= 60 max err = 3.197442e-14 k <= 70 max err = 4.396483e-14 k <= 80 max err = 4.685141e-14 k <= 90 max err = 4.907186e-14 k <= 100 max err = 5.084821e-14 k <= 110 max err = 5.373479e-14 k <= 120 max err = 5.551115e-14 k <= 130 max err = 7.41629e-14 k <= 140 max err = 9.592327e-14 k <= 150 max err = 9.636736e-14 k <= 160 max err = 9.725554e-14 k <= 170 max err = 9.947598e-14 k <= 180 max err = 1.04583e-13 k <= 190 max err = 1.088019e-13 k <= 200 max err = 1.090239e-13 minimum log2() of a wrong result for integer n : 53.32468 maximum error for real n : 1.090239e-13 Increasing accuracy of choose(n, k) for n almost an integer needed to use additional transformations of it to those already used in the code. I will work out a description of these transformations and send a link to it. Similarly as patches A and B, patch C also does not modify lchoose(). It should be pointed out that choose(n, k) for non-integer n is mainly needed if n is a rational number like 1/2, 1/3, 2/3, However, making choose(n, k) accurate for all inputs seems to be not too complex as the patch C and its test results show. I appreciate comments on patch C. Petr Savicky. --- R-devel-orig-intel/src/nmath/choose.c 2009-12-17 17:52:39.0 +0100 +++ R-devel-work-copy-3-intel/src/nmath/choose.c2009-12-23 21:59:40.0 +0100 @@ -93,13 +93,14 @@ return lfastchoose(n, k); } +#define IS_INT(x)((x) == floor(x)) #define k_small_max 30 /* 30 is somewhat arbitrary: it is on the *safe* side: * both speed and precision are clearly improved for k < 30. */ double choose(double n, double k) { -double r, k0 = k; +double r, k0 = k, l, aux; k = floor(k + 0.5); #ifdef IEEE_754 /* NaNs propagated correctly */ @@ -109,36 +110,37 @@ MATHLIB_WARNING2(_("'k' (%.2f) must be integer, rounded to %.0f"), k0, k); if (k < k_small_max) { int j; - if(n-k < k && n >= 0 && R_IS_INT(n)) k = n-k; /* <- Symmetry */ + if(n-k < k && n >= 0 && IS_INT(n)) k = n-k; /* <- Symmetry */ if (k < 0) return 0.; if (k == 0) return 1.; /* else: k >= 1 */ r = n; for(j = 2; j <= k; j++) - r *= (n-j+1)/j; - return R_IS_INT(n) ? floor(r + 0.5) : r; + r *= (n-(j-1))/j; + return IS_INT(n) ? floor(r + 0.5) : r; /* might have got rounding errors */ } /* else: k >= k_small_max */ if (n < 0) { - r = choose(-n+ k-1, k); - if (ODD(k)) r = -r; + r = n / k * choose(k - 1.0 - n, k - 1.0); + if (ODD(k - 1.0)) r = -r; return r; } -else if (R_IS_INT(n)) { +else if (IS_INT(n)) { if(n < k) return 0.; if(n - k < k_small_max) return choose(n, n-k); /* <- Symmetry */ return floor(exp(lfastchoose(n, k)) + 0.5); } /* else non-integer n >= 0 : */ -if (n < k-1) { - int s_choose; - r = lfastchoose2(n, k, /* -> */ &s_choose); - return s_choose * exp(r); +l = floor(n + 0.5); +if (l <= k-1) { + aux = lfastchoose(n, l) + lfastchoose(k - 1.0 - n, k - l - 1.0) - lfastchoose(k, l); + return exp(aux) * (n - l) / (k - l) * (ODD(k - l) ? 1.0 : - 1.0); } return exp(lfastchoose(n, k)); } #undef ODD +#undef IS_INT #undef R_IS_INT #undef k_small_max __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] seq.int broken (seq as well) (PR#14169)
On Mon, Dec 28, 2009 at 08:50:13PM +0100, oehl_l...@gmx.de wrote: [...] > # fine as expected from help page: > # "from+by, ..., up to the sequence value less than or equal to to" > # thus 1+10=11 is not in > > seq.int(1L, 10L, by=10L) > [1] 1 > > # of course 1+1e7 should also not be in > # but is: wrong > > seq.int(1L, 1e7L, by=1e7L) > [1] 1e+00 1e+07 > > # since we use seq.int AND are within integer range, rounding should not be an > issue In my opinion, this is a documented behavior. The Details section of the help page says Note that the computed final value can go just beyond 'to' to allow for rounding error, but (as from R 2.9.0) is truncated to 'to'. Since "by" is 1e7, going by 1 more than 'to' is "just beyond 'to'". What can be a bit misleading is the following difference between the type of seq() and seq.int(), which is only partially documented. x <- seq.int(from=1L, to=1000L, by=1000L); typeof(x); x # [1] "double" # [1] 1e+00 1e+07 x <- seq(from=1L, to=1000L, by=1000L); typeof(x); x # [1] "integer" # [1]1 1000 The Value section of the help page says: Currently, 'seq.int' and the default method of 'seq' return a result of type '"integer"' (if it is representable as that and) if 'from' is (numerically equal to an) integer and, e.g., only 'to' is specified, or also if only 'length' or only 'along.with' is specified. *Note:* this may change in the future and programmers should not rely on it. This suggests that we should get "double" in both cases, since all three arguments "from", "to", and "by" are specified. I do not know, whether having an "integer" result in seq() in the above case is intended or not. Petr Savicky. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] missing R-devel/po
When i unpack R-devel_2010-01-05.tar.bz2 and run ./configure on two Linux machines, i get the error message configure: creating ./config.status config.status: creating Makeconf config.status: creating Makefile config.status: creating doc/Makefile config.status: creating doc/html/Makefile config.status: creating doc/manual/Makefile config.status: creating etc/Makefile config.status: creating etc/Makeconf config.status: creating etc/Renviron config.status: creating etc/ldpaths config.status: creating m4/Makefile config.status: error: cannot find input file: po/Makefile.in.in For R-devel_2010-01-04.tar.bz2, ./configure runs fine. The reason could be that R-devel_2010-01-05.tar.bz2 does not contain the directory "R-devel/po". This directory is present in R-devel_2010-01-04.tar.bz2 and contains the file "R-devel/po/Makefile.in.in". I see the directory "R-devel/po" at https://svn.r-project.org/R/trunk/, so the problem could be just temporary. Petr Savicky. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] choose(n,k) when n is almost integer
I would like to add some more information concerning the patch C to the function choose() proposed in the email https://stat.ethz.ch/pipermail/r-devel/2009-December/056177.html The patch uses transformations of choose(n, k), which are described in http://www.cs.cas.cz/~savicky/R-devel/formulas_choose.pdf The accuracy of the modified function choose(n, k) may be verified on randomly generated examples using Rmpfr package http://cran.at.r-project.org/web/packages/Rmpfr/index.html and the script http://www.cs.cas.cz/~savicky/R-devel/test_choose_2.R The output, which i obtained, is > source("test_choose_2.R") k <= 9 max rel err = 9.41734e-16 k <= 19 max rel err = 2.084412e-15 k <= 29 max rel err = 3.170754e-15 k <= 39 max rel err = 4.99284e-14 k <= 49 max rel err = 5.927749e-14 k <= 59 max rel err = 6.526895e-14 k <= 69 max rel err = 6.526895e-14 k <= 79 max rel err = 8.783232e-14 k <= 89 max rel err = 1.051950e-13 k <= 99 max rel err = 1.051950e-13 k <= 109 max rel err = 1.072878e-13 k <= 119 max rel err = 1.072878e-13 k <= 129 max rel err = 1.179829e-13 k <= 139 max rel err = 1.247080e-13 k <= 149 max rel err = 1.247080e-13 k <= 159 max rel err = 1.255064e-13 k <= 169 max rel err = 1.255064e-13 k <= 179 max rel err = 1.267402e-13 k <= 189 max rel err = 1.311689e-13 k <= 199 max rel err = 1.573155e-13 k <= 209 max rel err = 1.844756e-13 Patch C also passes make check-all in the current development version 2.11.0 (2010-02-01). I appreciate comments. Petr Savicky. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] choose(n,k) when n is almost integer
On Tue, Feb 02, 2010 at 01:37:46PM +0100, Petr Savicky wrote: > I would like to add some more information concerning the patch C > to the function choose() proposed in the email > https://stat.ethz.ch/pipermail/r-devel/2009-December/056177.html > > The patch uses transformations of choose(n, k), which are described in > http://www.cs.cas.cz/~savicky/R-devel/formulas_choose.pdf > > The accuracy of the modified function choose(n, k) may be verified > on randomly generated examples using Rmpfr package > http://cran.at.r-project.org/web/packages/Rmpfr/index.html > and the script > http://www.cs.cas.cz/~savicky/R-devel/test_choose_2.R Let me add an explanation of the script test_choose_2.R. The script generates a vector of random real numbers n, which are from a continuous distribution, but concentrate near integer values. The original implementation of choose(m, k) considers m as an integer, if abs(m - round(m)) <= 1e-7. The vector n is generated so that the probability of abs(n[i] - round(n[i])) <= 1e-7 is approximately 0.1. The distribution of n[i] - round(n[i]) is symmetric around 0, so we get both n[i], which are close to an integer from below and from above. On the other hand, the probability of abs(n[i] - round(n[i])) >= 0.3 is approximately 0.1083404, so there are also numbers n[i], which are not close to an integer. The script calculates choose(n, k) for k in 0:209 (an ad hoc upper bound) 1. using the modified choose(n, k) from patch C 2. using the expression n(n-1)...(n-k+1)/(1 2 ... k) with Rmpfr numbers of precision 100 bits. The relative difference of these two results is computed and its maximum over all n[i] and k from 0 to a given bound is reported. The bounds on k are chosen to be the numbers, whose last digit is 9, since the algorithm in choose(n,k) is different for k <= 29 and k >= 30. An upper bound of the relative rounding error of a single operation with Rmpfr numbers of precision 100 bits is (1 + 2^-100). Hence, an upper bound on the total relative error of n(n-1)...(n-k+1)/(1 2 ... k) is (1 + 2^-100)^(2*209) \approx (1 + 2 * 209 * 2^-100) \approx 1 + 3.297e-28. This is a negligible error compared to the errors reported by test_choose_2.R, so the reported errors are the errors of the patched choose(n, k). The errors reported by test_choose_2.R with patched choose(n,k) are in a previous email. Running test_choose_2.R with unpatched R version 2.11.0 (2010-02-01 r51089) produces larger errors. > source("test_choose_2.R") k <= 9 max rel err = Inf k <= 19 max rel err = Inf > source("test_choose_2.R") k <= 9 max rel err = 0.111 k <= 19 max rel err = Inf > source("test_choose_2.R") k <= 9 max rel err = Inf k <= 19 max rel err = Inf > source("test_choose_2.R") k <= 9 max rel err = 8.383718e-08 k <= 19 max rel err = 1.226306e-07 k <= 29 max rel err = 1.469050e-07 Error: segfault from C stack overflow The Inf relative errors occur in cases, where choose(n, k) calculates 0, but the correct result is not 0. The stack overflow error is sometimes generated due to an infinite sequence of transformations choose(n, k) -> choose(n, n-k) -> choose(n, round(n-k)) which occur if k = 30 and n = 60 - eps. The reason for the transformation choose(n, k) -> choose(n, n-k) is that k >= k_small_max = 30 n is treated as an integer in R_IS_INT(n) n-k < k_small_max So, choose(n, n-k) is called. There, we determine that n-k is almost an integer and since n-k is the second argument of choose(n,k), it is explicitly rounded to an integer. Since n = 2*k - eps, we have round(n-k) = round(k - eps) = k. The result is that we again call choose(n, k) and this repeats until the stack overflow. For example > choose(60 - 1e-9, 30) Error: segfault from C stack overflow Besides patch C, which corrects this stack overflow, also the previous patches A, B from https://stat.ethz.ch/pipermail/r-devel/2009-December/056154.html correct this, but have lower accuracy. Petr Savicky. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Associative array?
On Thu, Mar 11, 2010 at 06:09:09PM -0600, Ben wrote: [...] > > I don't quite understand - characters are (after raw vectors) the > > most expressive data type, so I'm not quite sure why that would be a > > limitation .. You can cast anything (but raw vector with nulls) into > > to a character. > > It's no big problem, it's just that if the solution is to convert to > character type, then there are some implementation details to worry > about. For instance, I assume that as.character(x) is a reversible > 1-1 mapping if x is an integer (and not NA or NULL, etc). But > apparently that isn't exactly true for floats, and it would get more > complicated for other data types. Let me add a comment on the question of an invertible conversion of double values to character type. The function as.character() intentionally does not keep the exact value and performs some rounding. However, conversion of a double x to character type, which preserves the number exactly, may be obtained, for example, using formatC(x, digits=17, width=-1) Petr Savicky. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] .Call and .C arguments
On Mon, Mar 29, 2010 at 01:56:14PM +0200, Roger Bergande wrote: ... > The passed values are not the same in C. I 'm calling a function in C > with the argument as.double(1204.245) but in the debug mode in C the > value has changed to 1204.2449. I do not see this value printed in the examples below. So, i think you assume that 1204.245 was changed to 1204.2449 since rounding it to 2 digits produces 1204.24. The problem is indeed a rounding problem, but not when the number is passed to C. The number 1204.245 cannot be represented in double precision exactly. So, already at the R level is, actually formatC(1204.245, digits=20) # [1] "1204.24498909" See http://rwiki.sciviews.org/doku.php?id=misc:r_accuracy or FAQ 7.31 for more examples. Petr Savicky. > > Is there a way to pass the arguments differently? > > > > I'm using Windows and Visual Studio C++ 2005 Express Edition and > R-2.10.1. > > > > > > Please see the two simple examples to understand the issue. > > > > # C call from R > > .C("myroundC",as.double(1204.245)) > > > > > > // C Code > > > > void myroundC(double *Amount){ > > > > *Amount = Rf_fround(*Amount,2); > > > > } > > > > #Return value in R > > [[1]] > > [1] 1204.24 > > > > > > > > # C call from R > > .Call("myroundCall",as.double(1204.245)) > > > > // C Code > > SEXP myroundCall(SEXP a){ > > double *ap = REAL(a), *ansp; > > SEXP ans; > > PROTECT(ans = allocVector(REALSXP, 1)); > > ansp = REAL(ans); > > *ansp = Rf_fround(*ap,2); > > UNPROTECT(1); > > return(ans); > > } > > > > #Return value in R > > [1] 1204.24 > > > > # expected value 1204.25 > > > > > > Thanks a lot for your help. > > Best regards > > Roger Bergande > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] pure R code package for Windows
Dear R developers, I am using R under Linux, but I would like to share an extension package with some Windows users. The package contains only data and .R scripts. There is no src directory. So, I think I do not need a Windows machine with C compiler, "make", "sh" and "perl". If I am wrong, please tell me. I tried the following approaches (after verifying the package using R CMD check - 1 warning concerning the missing documentation for some of the R-functions.) 1. Installing the source package (with no C, C++ or F files) directly on Windows XP. Installation complains that "make" command is mising. OK, it is a source package. 2. Building binary package using R CMD build --binary --use-zip on Linux and try to install it under Windows XP. Installation complains that "make" command is missing. (Why, if it is a binary package?). 3. Build the package from source on Windows XP using R CMD build . Installation complains that "sh" is missing. (Why is it looking for "sh", if it is a properly working R installation under Windows?) 4. Install the package under Linux and zip the directory library/ and unzip it in the library directory on Windows machine. This works. The package behaves correctly. However, I do not think that this is a suggested method. Could you help me? Thank you in advance. Petr Savicky. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] pure R code package for Windows
> You can also use the Windows binary build and check service at > http://129.217.207.166 after reading its disclaimer... Thank you very much. This worked including generating .chm file. The resulting package is installable under R GUI on Windows XP, which is sufficient for me. Let me point out that I did not succeed to install any package (including the standard ones like rpart) on Windows using R CMD INSTALL , although I have Rtools and perl installed, running and in the path. The error message for windows binary package rpart from CRAN is: Error in Rdinfo(RdFiles[i]) : missing/empty \name field in 'E:/R/rpart/man/rpart.Rd.gz Rd files must have a non-empty \name. > >4. Install the package under Linux and zip the directory > > library/ and unzip it in the library directory > > on Windows machine. This works. The package behaves > > correctly. However, I do not think that this is a suggested > > method. > > Well, it _is_ a suggested method: see README.packages. This also produced a package installable by the standard procedure under Windows GUI. Thank you for pointing this out. (Again, R CMD INSTALL does not work for the resulting package.) All the best, Petr. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] grep with fixed=TRUE and ignore.case=TRUE
Dear R developers, I suggest to modify the behaviour of "grep" function with fixed=TRUE option. Currently, fixed=TRUE implies ignore.case=FALSE (overrides ignore.case=TRUE, if set by the user). I suggest to keep ignore.case as set by the user even if fixed=TRUE. Since the default of ignore.case is FALSE, this would not change the behaviour of grep, if the user does not set ignore.case explicitly. In my opinion, fixed=TRUE is most useful for suppressing meta-character expansion. On the other hand, for a simple word search, ignoring case is sometimes useful. If for some reason, it is better to keep the current behavior of grep, then I suggest to extend the documentation as follows: ORIGINAL: fixed: logical. If 'TRUE', 'pattern' is a string to be matched as is. Overrides all conflicting arguments. SUGGESTED: fixed: logical. If 'TRUE', 'pattern' is a string to be matched as is. Overrides all conflicting arguments including ignore.case. All the best, Petr Savicky. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] grep with fixed=TRUE and ignore.case=TRUE
On Wed, May 09, 2007 at 06:41:23AM +0100, Prof Brian Ripley wrote: > I suggest you collaborate with the person who replied that he thought this > was a good idea to supply patches against the R-devel sources for > scrutiny. A possible solution is to use strncasecmp instead of strncmp in function fgrep_one in R-devel/src/main/character.c. Corresponding modification of character.c is at http://www.cs.cas.cz/~savicky/ignore_case/character.c and diff file w.r.t. the original character.c (downloaded today) is at http://www.cs.cas.cz/~savicky/ignore_case/diff.txt This seems to work in my installation of R-devel: > x <- c("D.G cat", "d.g cat", "dog cat") > z <- "d.g" > grep(z, x, ignore.case = F, fixed = T) [1] 2 > grep(z, x, ignore.case = T, fixed = T) # this is the new behavior [1] 1 2 > grep(z, x, ignore.case = T, fixed = F) [1] 1 2 3 > Since fgrep_one is used many times in character.c, adding igcase_opt as an additional argument would imply extensive changes to the file. So, I introduced a new function fgrep_one_igcase called only once in the file. Another solution is possible. I do not understand well handling multibyte chars, so I did not test the function with real multibyte chars, although the code for this option is used. Ignore case option is not meaningfull in gsub. It could be meaningful in regexpr, however, this function does not allow ignore.case option, so I did no changes to it. All the best, Petr. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Problem with rnorm ?
> #include > #include > > int main(void){ > double x[3]; > for(int i=0;i<3;i++) x[i]=rnorm(0,1); > printf("%lf,%lf,%lf",x[0],x[1],x[2]); > return 0; > } > > output : -8.773321,-8.773321,-8.773321 You probably have to call GetRNGstate() before rnorm and PutRNGstate() after. At least for extension packages, this is necessary, see section 6.3 in R-exts.pdf. Petr. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] grep with fixed=TRUE and ignore.case=TRUE
> strncasecmp is not standard C (not even C99), but R does have a substitute > for it. Unfortunately strncasecmp is not usable with multibyte charsets: > Linux systems have wcsncasecmp but that is not portable. In these days of > widespread use of UTF-8 that is a blocking issue, I am afraid. What could help are the functions mbrtowc and towctrans and simple long integer comparison. Are the functions mbrtowc and towctrans available under Windows? mbrtowc seems to be available as Rmbrtowc in src/gnuwin32/extra.c. I did not find towctrans defined in R sources, but it is in gnuwin32/Rdll.hide and used in do_tolower. Does this mean that tolower is not usable with utf-8 under Windows? > In the case of grep I think all you need is > > grep(tolower(pattern), tolower(x), fixed = TRUE) > > and similarly for regexpr. Yes. this is correct, but it has disadvantages. It needs more space and, if value=TRUE, we would have to do something like x[grep(tolower(pattern), tolower(x), fixed = TRUE, value=FALSE)] This is hard to implement in src/library/base/R/grep.R, where the call to .Internal(grep(pattern,...)) is the last command and I think this should be preserved. > >Ignore case option is not meaningfull in gsub. > > sub("abc", "123", c("ABCD", "abcd"), ignore.case=TRUE) > > is different from 'ignore.case=FALSE', and I see the meaning as clear. > So what did you mean? (Unfortunately the tolower trick does not work for > [g]sub.) The meaning of ignore.case in [g]sub is problematic due to the following. sub("abc", "xyz", c("ABCD", "abcd"), ignore.case=TRUE) produces [1] "xyzD" "xyzd" but the user may in fact need the following [1] "XYZD" "xyzd" It is correct that "xyzD" "xyzd" is produced, but the user should be aware of the fact that several substitutions like x <- sub("abc", "xyz", c("ABCD", "abcd")) # ignore.case=FALSE sub("ABC", "XYZ", x) # ignore.case=FALSE may be more useful. I have another question concerning the speed of grep. I expected that fgrep_one function is slower than calling a library routine for regular expressions. In particular, if the pattern has a lot of long partial matches in the target string, I expected that it may be much slower. A short example is y <- "ab" x <- "aaab" grep(y,x) which requires 110 comparisons (10 comparisons for each of 11 possible beginnings of y in x). In general, the complexity in the worst case is O(m*n), where m,n are the lengths of y,x resp. I would expect that the library function for matching regular expressions needs time O(m+n) and so will be faster. However, the result obtained on a larger example is > x1 <- paste(c(rep("a", times = 1000), "b"), collapse = "") > x2 <- paste(c("b", rep("a", times = 1000)), collapse = "") > y <- paste(c(rep("a", times = 1), x2), collapse = "") > z <- rep(y, times = 100) > system.time(i <- grep(x1, z, fixed = T)) [1] 1.970 0.000 1.985 0.000 0.000 > system.time(i <- grep(x1, z, fixed = F)) # reg. expr. surprisingly slow (*) [1] 40.374 0.003 40.381 0.000 0.000 > system.time(i <- grep(x2, z, fixed = T)) [1] 0.113 0.000 0.113 0.000 0.000 > system.time(i <- grep(x2, z, fixed = F)) # reg. expr. faster than fgrep_one [1] 0.019 0.000 0.019 0.000 0.000 Do you have an explanation of these results, in particular (*)? Petr. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] R 2.5.0 refuses to print enough digits to recover exact floating point values
> > Well, okay, now what about dump, write.table, save, etc? > > save() uses the required precision. For exp(1) it stores > "2.718281828459045" and you will see that > > exp(1) == 2.718281828459045 is TRUE > save(...,ascii=TRUE) uses 16 digit precision, but this seems not to be sufficient. In R-2.5.0, I obtained: > otab <- data.frame(val=1+1/(1:1000)); > ntab <- otab; > save(otab,file="save.txt",ascii=TRUE); > rm(otab); > load("save.txt"); > cbind(table(otab[,1]-ntab[,1])) [,1] -4.44089209850063e-16 159 -2.22044604925031e-16 220 0 240 2.22044604925031e-16 213 4.44089209850063e-16 168 So, most of the numbers are changed. The changes are within the last two bits. The last bit represents the difference 2^(-52) = 2.220446e-16. Petr. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] R 2.5.0 refuses to print enough digits to recover exact floating point values
This is an addition to my previous message. 16 digits are indeed not sufficient to represent a double value exactly. The next table is calculated in bc calculator. It shows that if we restrict (or round) double values to 16 decadic digits then from 4 to 5 consecutive distinct double values get the same 16 digit decadic code. 1.234567890123456 a 16-digit number for comparison 1. 1 + 0*2^(-52) 1.0002220446049250 1 + 1*2^(-52) 1.0004440892098500 1 + 2*2^(-52) 1.0006661338147750 1 + 3*2^(-52) 1.0008881784197001 1 + 4*2^(-52) 1.0011102230246251 1 + 5*2^(-52) 1.0013322676295501 1 + 6*2^(-52) 1.0015543122344752 1 + 7*2^(-52) 1.0017763568394002 1 + 8*2^(-52) 1.0019984014443252 1 + 9*2^(-52) 1.0022204460492503 1 + 10*2^(-52) 1.0024424906541753 1 + 11*2^(-52) 1.0026645352591003 1 + 12*2^(-52) 1.0028865798640254 1 + 13*2^(-52) 1.0031086244689504 1 + 14*2^(-52) 1.0033306690738754 1 + 15*2^(-52) 1.0035527136788005 1 + 16*2^(-52) 1.0037747582837255 1 + 17*2^(-52) 1.0039968028886505 1 + 18*2^(-52) 1.0042188474935755 1 + 19*2^(-52) Petr. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] documented/undocumented behavior of as.double(formatC(x, digits=17))
Some days ago, there was a discussion about the command formatC(exp(1),digits=100,width=-1) Converting a double value to a string, from which the double may be reconstructed exactly, may be useful. So, I did some experimentation with it in my linux installation of R-2.5.0. I generated a vector x of a large number of random doubles (random sign, random mantissa with 53 significant bits and binary exponent from -1074:1023) and did the following y <- formatC(x,digits=17,width=-1) z <- as.double(y) all(x == z) # TRUE On Wed, May 23, 2007 at 06:32:36PM +0100, Prof Brian Ripley wrote: > However, the C99 standard does make clear that [sf]printf is not required > (even as 'recommended practice') to be accurate to more than *_DIG places, > which as ?as.character has pointed out is 15 on the machines R runs on. > > It really is the case that writing out a number to > 15 significant digits > and reading it back in again is not required to give exactly the same > IEC60559 (sic) number, and furthermore there are R platforms for which > this does not happen. I think that this implies that preserving the exact double value in formatC does not follow from C99 standard. Is there some other (e.g. platform specific) documentation that implies this or it has to be used as an undocumented feature? (Well, I do know that the R sources or gcc sources are such a documentation, but I am trying to find some other). Petr. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Determinant function (PR#9715)
> The function ''det'' works improperly for a singular matrix and returns a > non-zero value even if ''solve'' reports singularity. The matrix is very > simple > as shown below. > > A <- diag(rep(c(64,8), c(8,8))) > A[9:16,1] <- 8 > A[1,9:16] <- 8 > > det(A) > #[1] -196608 > solve(A) > #Error in solve.default(A) : system is computationally singular: reciprocal > condition number = 2.31296e-18 The "det" function cannot work properly in the limited precision of floating point numbers. May be, "det" could also do a test of computational singularity and refuse to compute the value similarly as "solve" does. The eigen-values of your matrix are > eigen(A)$values [1] 7.20e+01 6.40e+01 6.40e+01 6.40e+01 6.40e+01 [6] 6.40e+01 6.40e+01 6.40e+01 8.00e+00 8.00e+00 [11] 8.00e+00 8.00e+00 8.00e+00 8.00e+00 8.00e+00 [16] -2.023228e-15 The last value is not zero due to rounding. The determinant is the product of eigenvalues, so we get something large. The problem may also be seen as follows: > det(A/8) [1] -6.98492e-10 This is close to zero as it should be, however, det(A) = det(A/8)*8^16, and this is what we really get: > det(A/8)*8^16 [1] -196608 > det(A) [1] -196608 There are even matrices closer to a diagonal matrix than A with a similar problem: > B <- 20*diag(16); B[1:2,1:2] <- c(25,35,35,49); det(B); [1] 44565.24 All the best, Petr. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] R 2.5.0 refuses to print enough digits to recover exact floating point values
I would like to reply to the following email from May in the thread [Rd] R 2.5.0 refuses to print enough digits to recover exact floating point values On Wed, May 23, 2007 at 06:32:36PM +0100, Prof Brian Ripley wrote: > I think this is a bug in the MacOS X runtime. I've checked the C99 > standard, and can see no limits on the precision that you should be able > to specify to printf. > > Not that some protection against such OSes would come amiss. > > However, the C99 standard does make clear that [sf]printf is not required > (even as 'recommended practice') to be accurate to more than *_DIG places, > which as ?as.character has pointed out is 15 on the machines R runs on. Let me use http://www.open-std.org/JTC1/SC22/WG14/www/docs/n1124.pdf In Section 7.19.6, Recommended practice, paragraph 13, it is specified that rounding to DECIMAL_DIG should be done correctly. In Annex F.5 Binary-decimal conversion, it is explicitly stated that conversion from the widest supported IEC 60559 format to decimal with DECIMAL_DIG digits and back is the identity function. In footnote 305, it is specified that if double (53 bit precision) is the widest format supported, then DECIMAL_DIG >= 17. R uses double precision as default, so I think DECIMAL_DIG = 17 is a reasonable value. This is the minimum number of decimal digits, for which the conversion double -> string -> double is value preserving, if rounding is correct. DBL_DIG = 15 is another constant, which is the maximum number of digits, for which the conversion string -> double -> string may be done value preserving. > It really is the case that writing out a number to > 15 significant digits > and reading it back in again is not required to give exactly the same > IEC60559 (sic) number, and furthermore there are R platforms for which > this does not happen. What Mr Weinberg claimed is 'now impossible' never > was possible in general (and he seems to be berating the R developers for > not striving to do better than the C standard requires of OSes). In fact, > I believe this to be impossible unless you have access to extended > precsion arithmetic, as otherwise printf/scanf have to use the same > arithmetic as the computations. > > This is why R supports transfer of floating-point numbers via readBin and > friends, and uses a binary format itself for save() (by default). > > I should also say that any algorithm that depends on that level of details > in the numbers will depend on the compiler used and optimization level and > so on. Don't expect repeatability to that level even with binary data > unless you (are allowed to) never apply bugfixes to your system. Repeatability can hardly be achieved among different R installations, due to the reasons you explain. However, repeatability of a computation within the same installation may be achieved and may be sometimes useful. For example it may be useful for debugging, similarly as set.seed. Saving the data in binary is a possible approach for this, however, decimal numbers in a text may serve this purpose equally well, if they are precise enough. I would like to ask a question concerning the function scientific in src/main/format.c, which is called from formatReal in the same file. Function scientific, besides other things, determines, whether a smaller number of significant digits than R_print.digits is sufficient to get a decimal number with relative error at most max(10^-R_print.digits,2*DBL_EPSILON), where DBL_EPSILON=2^-52. For simplicity, let me consider only the case, when R_print.digits = DBL_DIG (which is 15). Then the bound for the relative error is 1e-15, since 1e-15 > 2*2^-52. There are two error bounds in consideration: (1) the absolute error of the rounded mantissa (significand) is at most 5e-15, (2) the relative error of the rounded number is at most 1e-15. It is natural to consider also (1), since R_print.digits = 15 and 5e-15 is the maximum absolute error of the mantissa for correct rounding to 15 significant digits. If the mantissa is in [1,5], then (2) => (1). Hence, for these mantissas, function scientific should suggest a smaller number of digits than 15 only if the result is as accurate as rounding to 15 digits. On the other hand, if the mantissa is in (5,10), then (2) does not imply (1). Hence, here we sometimes do not get the full precision 15 digit numbers. Is this behaviour the intention? This has, for example, the following consequence: if a 15-digit number is read into R using read.table and then written back to a text using write.table, we need not get the same result. For testing, I use the following platforms: FLAGS means CFLAGS,FFLAGS,CXXFLAGS,FCFLAGS on Linuxes, R version is 2.5.1 RC (2007-06-23 r42041), on Windows, R version is binary distribution of R-2.5.1 (RC). [1] AMD Athlon(tm) 64 Processor 3500+, SUSE 10.1, gcc 4.1.0, FLAGS = -g -O2 -march=pentium4 -mfpmath=sse [2] AMD Athlon(tm) 64 Processor 3500+, SUSE 10.1,
[Rd] make check for 2.5.1 RC fails on mbcsToSbcs in graphics
configure and make run OK, but make check failed for R version 2.5.1 RC (2007-06-26 r42068) on graphics with error: > ## The following two examples use latin1 characters: these may not > ## appear correctly (or be omitted entirely). > plot(1:10, 1:10, main = "text(...) examples\n~~", + sub = "R is GNU Â, but not  ...") Error in title(...) : conversion failure in 'mbcsToSbcs' Execution halted The whole tests/Examples/graphics-Ex.Rout.fail is at http://www.cs.cas.cz/~savicky/R-devel/graphics-Ex.Rout.fail The end of make check report is: make[5]: Leaving directory `/home/petr/R/DEVEL/R-rc-2007-06-26/src/library' running code in 'grDevices-Ex.R' ... OK collecting examples for package 'graphics' ... make[5]: Entering directory `/home/petr/R/DEVEL/R-rc-2007-06-26/src/library' >>> Building/Updating help pages for package 'graphics' Formats: text html latex example make[5]: Leaving directory `/home/petr/R/DEVEL/R-rc-2007-06-26/src/library' running code in 'graphics-Ex.R' ...make[4]: *** [graphics-Ex.Rout] Error 1 make[4]: Leaving directory `/home/petr/R/DEVEL/R-rc-2007-06-26/tests/Examples' make[3]: *** [test-Examples-Base] Error 2 make[3]: Leaving directory `/home/petr/R/DEVEL/R-rc-2007-06-26/tests/Examples' make[2]: *** [test-Examples] Error 2 make[2]: Leaving directory `/home/petr/R/DEVEL/R-rc-2007-06-26/tests' make[1]: *** [test-all-basics] Error 1 make[1]: Leaving directory `/home/petr/R/DEVEL/R-rc-2007-06-26/tests' make: *** [check] Error 2 Otherwise, the installation works. R.Version platform i686-pc-linux-gnu arch i686 os linux-gnu system i686, linux-gnu status RC major 2 minor 5.1 year 2007 month 06 day26 svn.rev42068 language R version.string R version 2.5.1 RC (2007-06-26 r42068) System SUSE LINUX 10.0, gcc version 4.0.2 20050901 (prerelease) (SUSE Linux), default options, $LANG = cs_CZ.UTF-8 $LC_MESSAGES = POSIX $LC_TIME = POSIX no other LC_* set Petr. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] inaccuracy in qbinom with partial argument matching
> > ## partial argument matching: > > qbinom(p0 , s = 3,p= 0.25) ## 1 ??? > > qbinom(p0-0.05, s = 3,p= 0.25) ## 1 ??? > > qbinom(p0-0.06, s = 3,p= 0.25) ## 0 o.K. > > > > Unfortunately I have no I idea how to fix this. > > You use a call that specifies your intentions accurately. This is not > 'partial argument matching': 'p' is an exact match to the first argument > of > > > args(qbinom) > function (p, size, prob, lower.tail = TRUE, log.p = FALSE) > > and that is how argument matching in R is documented to work. > > The 'inaccuracy' is in the diagnosis: please see the FAQ. Let me add an explanation, why qbinom(p0 , s = 3,p= 0.25) does not produce an error message about missing "prob" argument: Since "size" and "p" arguments are given, p0 is used for the third argument and not for the first. Although the behavior is logical, it may not be immediately clear. I do not see this case explicitly in FAQ or R-intro.pdf 10.3. Petr. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] make check for 2.5.1 RC fails on mbcsToSbcs in graphics
On Wed, Jun 27, 2007 at 04:20:39PM +0100, Prof Brian Ripley wrote: > But you can't really avoid it if you want to test non-ASCII > features: for example, you cannot test Latin-1 graphics in a C locale. > > It was intended that the postscript device opened for testing graphics was > opened with encoding="ISOLatin1". That should certainly have helped, at > least in a UTF-8 locale, but unfortunately there were a couple of typos in > tests/Examples/Makefile. Fixing that file makes 'make check' work in > cs_CZ.utf8 at least on my machine. > > It still remains that (for examples) you cannot successfully run make > check in a MBCS locale without iconv, nor will zh_CN.gbk nor zh_TW.big5 > work. I've added a note to R-admin to say that you may need to try a > suitable locale. I applied the following patch obtained by comparing R-devel_2007-06-26 and R-devel_2007-06-27 --- D070626/R-devel/tests/Examples/Makefile.in 2007-05-10 17:50:54.0 +0200 +++ D070627/R-devel/tests/Examples/Makefile.in 2007-06-27 17:50:09.0 +0200 @@ -82,8 +82,8 @@ @if test -f $@; then mv $@ [EMAIL PROTECTED]; fi @echo $(ECHO_N) "running code in '$<' ...$(ECHO_C)" @echo 'tools:::.convert_examples("graphics-Ex.R", "graphics-Ex.R-locale", "latin1")' | $(R) --slave > /dev/null 2>&1 || exit 1 - @sed -e 's/grDevices::postscript("graphics-Ex.ps")/grDevices::postscript("graphics-Ex.ps",encoding="ISOlatin1")/' graphics-Ex.R-locale \ - | $(R) < graphics-Ex.R-locale > $@ 2>&1 || (mv $@ [EMAIL PROTECTED] && exit 1) + @sed -e 's/grDevices::postscript("graphics-Ex.ps")/grDevices::postscript("graphics-Ex.ps",encoding="ISOLatin1")/' graphics-Ex.R-locale \ + | $(R) > $@ 2>&1 || (mv $@ [EMAIL PROTECTED] && exit 1) @rm graphics-Ex.R-locale @echo "$(ECHO_T) OK" @if test -f [EMAIL PROTECTED]; then \ to my copy of R-2.5.1. Then, make check works with no error in cs_CZ.UTF-8 locale also on my machine. Thank you for the correction. Petr. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Shapiro Test P Value Incorrect? (PR#9768)
This is not a bug. The algorithm uses different approximation of the p-value for n=3 (exact value), 4<=n<=11 and n>=12 as seen in src/library/stats/src/swilk.c below the line 202 /* Calculate significance level for W */ The W statistic monotonically decreases in the presented example. Petr. > Full_Name: Jason Polak > Version: R version 2.5.0 (2007-04-23) > OS: Xubuntu 7.04 > Submission from: (NULL) (137.122.144.35) > > > Dear R group, > > I have noticed a strange anomaly with the shapiro.test() function. > Unfortunately > I do not know how to calculate the shapiro test P values manually so I don't > know if this is an actual bug. > > So, to produce the results, run the following code: > > pvalues = 0; > for (i in 1:17) > { > j = 1:(i+3); > pvalues[i]=shapiro.test(j)$p; > } > > plot(pvalues); > print(pvalues); > > Now I just made the graph to illustrate that the p-values are strictly > decreasing. To me this makes intuitive sense: we are using the Shapiro test to > test normality of (1,2,3,4),(1,2,3,4,5), and so on. So the p-value should > decrease. > > These are the p-values: > [1] 0.9718776 0.9671740 0.9605557 0.9492892 0.9331653 0.9135602 0.8923668 > [8] 0.8698419 0.8757315 0.8371814 0.7964400 0.7545289 0.7123167 0.6704457 > [15] 0.6294307 0.5896464 0.5513749 > > However, there is an increase in p-value when you go from (1,..,11) to > (1,..,12). Is this just a quirk of the Shapiro test, or is there an error in > the > calculation algorithm? > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] sweep sanity checking?
The suggestion sounds reasonable to me. Let me add that sweep is written to work even if MARGIN includes more than one dimension. To handle these cases correctly, the test may be replaced e.g. by if (check.margin && prod(dims[MARGIN])!=length(STATS)) { warning("length(STATS) != prod(dim(x)[MARGIN])") } else if (prod(dims[MARGIN]) %% length(STATS)!=0) warning("prod(dim(x)[MARGIN]) is not a multiple of length(STATS)") or even by dimstat <- if (is.null(dim(STATS))) length(STATS) else dim(STATS) if (check.margin && any(dims[MARGIN]!=dimstat)) { warning("length(STATS) or dim(STAT) do not match dim(x)[MARGIN]") } else if (prod(dims[MARGIN]) %% length(STATS)!=0) warning("prod(dim(x)[MARGIN]) is not a multiple of length(STATS)") Petr. > Just an opinion from an R user: I think it's a sound idea. I use my own > version of sweep with a stricter check: it stops if the vector is not > exactly the right length. > > -- Tony Plate > > Ben Bolker wrote: > > Ben Bolker zoo.ufl.edu> writes: > > > > > >> What would R-core think of the following 'enhanced' > >> sweep? > >> > > > > (now posted at > > http://wiki.r-project.org/rwiki/doku.php?id=rdoc:base:sweep > > ) > > > > It always warns if dim(x)[MARGIN] is > > __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] sweep sanity checking?
I am sorry for an incomplete proposal. The stricter check if (check.margin && any(dims[MARGIN]!=dimstat)) { was meant to be if (check.margin && (length(dimstat)!=length(MARGIN) || any(dims[MARGIN]!=dimstat))) { Petr. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] sweep sanity checking?
I would like to suggest a replacement for the curent function sweep based on the two previous suggestions posted at https://stat.ethz.ch/pipermail/r-help/2005-June/073989.html and http://wiki.r-project.org/rwiki/doku.php?id=rdoc:base:sweep with some extensions. My suggestion is to use one of the following two variants. They differ in whether length(STATS) == 1 is allowed without a warning in the stricter (default) check or not. I don't know, what is better. sweep1 <- function (x, MARGIN, STATS, FUN = "-", check.margin=TRUE, ...) { FUN <- match.fun(FUN) dims <- dim(x) dimstat <- if (is.null(dim(STATS))) length(STATS) else dim(STATS) if (length(MARGIN) < length(dimstat)) { STATS <- drop(STATS) dimstat <- if (is.null(dim(STATS))) length(STATS) else dim(STATS) } if (check.margin && length(STATS) > 1 && (length(dimstat)!=length(MARGIN) || any(dims[MARGIN]!=dimstat))) { warning("length(STATS) or dim(STAT) do not match dim(x)[MARGIN]") } else if (prod(dims[MARGIN]) %% length(STATS)!=0) warning("prod(dim(x)[MARGIN]) is not a multiple of length(STATS)") perm <- c(MARGIN, (1:length(dims))[-MARGIN]) FUN(x, aperm(array(STATS, dims[perm]), order(perm)), ...) } sweep2 <- function (x, MARGIN, STATS, FUN = "-", check.margin=TRUE, ...) { FUN <- match.fun(FUN) dims <- dim(x) dimstat <- if (is.null(dim(STATS))) length(STATS) else dim(STATS) if (length(MARGIN) < length(dimstat)) { STATS <- drop(STATS) dimstat <- if (is.null(dim(STATS))) length(STATS) else dim(STATS) } if (check.margin && (length(dimstat)!=length(MARGIN) || any(dims[MARGIN]!=dimstat))) { warning("length(STATS) or dim(STAT) do not match dim(x)[MARGIN]") } else if (prod(dims[MARGIN]) %% length(STATS)!=0) warning("prod(dim(x)[MARGIN]) is not a multiple of length(STATS)") perm <- c(MARGIN, (1:length(dims))[-MARGIN]) FUN(x, aperm(array(STATS, dims[perm]), order(perm)), ...) } The functions are tested on the following examples. a <- array(1:64,dim=c(4,4,4)) M <- c(1,3); b sweep1(a,M,b) sweep2(a,M,b) sweep1(a,M,b,c=F) sweep2(a,M,b,c=F) a[,2,] - - - - a[1:2,2,] warning warning - - 1 - warning - - 1:3warning warning warning warning 1:16 warning warning - - a <- matrix(1:8,nrow=4,ncol=2); M <- 1; b sweep1(a,M,b) sweep2(a,M,b) sweep1(a,M,b,c=F) sweep2(a,M,b,c=F) 1 - warning - - 1:2warning warning - - 1:3warning warning warning warning 1:4- - - - matrix(1:4,nrow=1) # (A) - - - - matrix(1:4,ncol=1) # (B) - - - - a <- matrix(1:8,nrow=4,ncol=2); M <- 2; b sweep1(a,M,b) sweep2(a,M,b) sweep1(a,M,b,c=F) sweep2(a,M,b,c=F) 1 - warning - - 1:2- - - - 1:3warning warning warning warning 1:4warning warning warning warning matrix(1:2,ncol=1) # (A) - - - - matrix(1:2,nrow=1) # (B) - - - - Note that cases marked (A) do not generate a warning, although they should. This is the cost of using drop(STATS), which allows cases marked (B) without a warning. In my opinion, cases (B) make sense. Reproducing the tests is possible using the script http://www.cs.cas.cz/~savicky/R-devel/verify_sweep.R Petr. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] sweep sanity checking?
I would like to suggest a patch against R-devel-2007-07-24, which modifies function sweep by including a warning, if dim(STATS) is not consistent with dim(x)[MARGIN]. If check.margin=FALSE, the simple test whether prod(dim(x)[MARGIN]) is a multiple of length(STATS) is performed. If check.margin=TRUE, then a more restrictive test is used, but a limited recycling is still allowed without warning. Besides generating a warning in some situations, there is no other change in the behavior of sweep. The patch is: --- R-devel_2007-07-24/src/library/base/R/sweep.R 2007-01-04 17:51:53.0 +0100 +++ R-devel_2007-07-24-sweep/src/library/base/R/sweep.R 2007-07-24 10:56:18.0 +0200 @@ -1,7 +1,21 @@ -sweep <- function(x, MARGIN, STATS, FUN = "-", ...) +sweep <- function(x, MARGIN, STATS, FUN = "-", check.margin=FALSE, ...) { FUN <- match.fun(FUN) dims <- dim(x) +if (check.margin) { +dimmargin <- dims[MARGIN] +dimmargin <- dimmargin[dimmargin > 1] +dimstats <- if (is.null(dim(STATS))) length(STATS) else dim(STATS) +dimstats <- dimstats[dimstats > 1] +if (length(dimstats) > length(dimmargin) || +any(dimstats != dimmargin[seq(along.with=dimstats)])) +warning("length(STATS) or dim(STATS) do not match dim(x)[MARGIN]") +} else if (prod(dims[MARGIN]) %% length(STATS) != 0) { +if (length(MARGIN) == 1) +warning("dim(x)[MARGIN] is not a multiple of length(STATS)") +else +warning("prod(dim(x)[MARGIN]) is not a multiple of length(STATS)") +} perm <- c(MARGIN, (1:length(dims))[ - MARGIN]) FUN(x, aperm(array(STATS, dims[perm]), order(perm)), ...) } The patch uses the default check.margin=FALSE, since this is more backward compatible. Changing the default to check.margin=TRUE would also be fine with me and also with Ben Bolker, who told me this in a separate email. Let me include more comments on the stricter test. If check.margin=TRUE, then the patch tests whether (after deleting possible dimensions with only one level) dim(STATS) is a prefix of dim(x)[MARGIN]. Hence, for example, if dim(x)[MARGIN] = c(k1,k2), the cases length(STATS) = 1, dim(STATS) = k1, dim(STATS) = NULL and length(STATS) = k1, dim(STATS) = c(k1,k2) are accepted without warning. On the other hand, if k1 != k2, then, for example, dim(STATS)= k2, dim(STATS) = c(k2,k1) generate a warning, although the simple divisibility condition length(STATS) divides prod(dim(x)[MARGIN]) is satisfied. The warning is generated, since in the last two cases, recycling produces incorrect or at least suspicious result. In the simplest case, when length(MARGIN)=1 and STATS is a vector, the cases accepted by the stricter test without warning are exactly the following two: length(STATS) = 1, length(STATS) = dim(x)[MARGIN]. I tested the patch using the script http://www.cs.cas.cz/~savicky/R-devel/verify_sweep1.R Ben Bolker also tested the patch in his environment. I appreciate to know the opinion of R core developers on this patch. Thank you in advance. Petr. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] sweep sanity checking?
When I was preparing the patch of sweep submitted on July 25, I was unaware of the code by Heather Turner. She suggested a very elegant solution, if STATS is a vector and we want to use meaningful recycling in full generality. I would like to suggest a combined solution, which uses Heather Turner's algorithm if check.margin=FALSE (default) and STATS is a vector and my previous algorithm, if check.margin=TRUE or STATS is an array. The suggestion is # combined from the original code of sweep without warnings and from # https://stat.ethz.ch/pipermail/r-help/2005-June/073989.html by Robin Hankin # https://stat.ethz.ch/pipermail/r-help/2005-June/074001.html by Heather Turner # https://stat.ethz.ch/pipermail/r-devel/2007-June/046217.html by Ben Bolker # with some further modifications by Petr Savicky sweep <- function(x, MARGIN, STATS, FUN = "-", check.margin=FALSE, ...) { FUN <- match.fun(FUN) dims <- dim(x) dimmargin <- dims[MARGIN] if (is.null(dim(STATS))) { dimstats <- length(STATS) } else { dimstats <- dim(STATS) check.margin <- TRUE } s <- length(STATS) if (s > prod(dimmargin)) { warning("length of STATS greater than the extent of dim(x)[MARGIN]") } else if (check.margin) { dimmargin <- dimmargin[dimmargin > 1] dimstats <- dimstats[dimstats > 1] if (length(dimstats) > length(dimmargin) || any(dimstats != dimmargin[seq(along.with=dimstats)])) warning("length(STATS) or dim(STATS) do not match dim(x)[MARGIN]") } else { cumDim <- c(1, cumprod(dimmargin)) upper <- min(cumDim[cumDim >= s]) lower <- max(cumDim[cumDim <= s]) if (upper %% s != 0 || s %% lower != 0) warning("STATS does not recycle exactly across MARGIN") } perm <- c(MARGIN, (1:length(dims))[ - MARGIN]) FUN(x, aperm(array(STATS, dims[perm]), order(perm)), ...) } Heather presented four examples testing her code: sweep(array(1:24, dim = c(4,3,2)), 1, 1:2)# no warning sweep(array(1:24, dim = c(4,3,2)), 1, 1:12) # no warning sweep(array(1:24, dim = c(4,3,2)), 1, 1:24) # no warning sweep(array(1:24, dim = c(4,3,2)), 1:2, 1:3) # warning The second and third example are not really correct, since STATS extends also to dimensions not included in MARGIN. The problem is better visible for example in sweep(array(1:24, dim = c(4,4,3,3,2,2)), c(1,3), 1:12) where MARGIN clearly has to contain two dimensions explicitly. So, I use the examples with a larger margin corresponding to STATS as follows sweep(array(1:24, dim = c(4,3,2)), 1, 1:2)# no warning sweep(array(1:24, dim = c(4,3,2)), 1:2, 1:12) # no warning sweep(array(1:24, dim = c(4,3,2)), 1:3, 1:24) # no warning sweep(array(1:24, dim = c(4,3,2)), 1:2, 1:3) # warning The current proposal for sweep indeed gives no warning in the first three examples and gives a warning in the last one. I did not use the suggestion to call the option warn with default warn = getOption("warn"). The reason is that there are two different decisions: (1) whether to generate a warning (2) what to do with the warning, if it is generated. The warn option influences (2): the warning may be suppressed, printed after the return to the top level, printed immediately or it may be converted to an error. I think that the option controling (2) should not be mixed with an option which controls (1). If R has an option controling to which extent recycling is allowed, then this could be used, but warn has a different purpose. I appreciate feedback. Petr. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Optimization in R
I would like to add a remark and a question. Remark. There is a part of R, which allows the user to select among several methods for the same task and also to add his own C code: random number generation. However, the interface for optimization is more complex. In my opinion, looking for a unified interface for this is desirable, but it is a research problem, not a suggestion for an immediate code modification. Question. Is there a way how to optimize a function written in C using optim? This would be very useful, if the optimization needs a lot of iterations. This may be done by defining an R function, which does nothing more than calling .C with appropriate parameters, but this looses efficiency. A more efficient solution could be adding a specified entry point (or several, if derivatives are also available), similar as in the user defined random number generator. Then, a parameter of optim could control, whether the function to be optimized is fn or the C entry point. Petr Savicky. > I don't have an example of that but that does not make it less > desirable. If one wants to use method 1, 2 or 3 then one can > use optim with a method= but if one wants to use methods 4 > or 5 then one must use an entirely different function. Surely > it would be better to be consistent from the user's viewpoint > and allow all of them to work consistently through the same > interface. > > On 8/4/07, Duncan Murdoch <[EMAIL PROTECTED]> wrote: > > On 04/08/2007 2:53 PM, Gabor Grothendieck wrote: > > > The example of generic functions. > > > > Show me an example where we have a list of ways to do a calculation > > passed as an argument (analogous to the method argument of optim), where > > the user is allowed to add his own function to the list. > > > > Duncan Murdoch > > > __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Optimization in R
I am sorry for omitting a citation in my previous post. The complete message is as follows (my text unchanged). PS I would like to add a remark and a question. Remark. There is a part of R, which allows the user to select among several methods for the same task and also to add his own C code: random number generation. However, the interface for optimization is more complex. In my opinion, looking for a unified interface for this is desirable, but it is a research problem, not a suggestion for an immediate code modification. Question. Is there a way how to optimize a function written in C using optim? This would be very useful, if the optimization needs a lot of iterations. This may be done by defining an R function, which does nothing more than calling .C with appropriate parameters, but this looses efficiency. A more efficient solution could be adding a specified entry point (or several, if derivatives are also available), similar as in the user defined random number generator. Then, a parameter of optim could control, whether the function to be optimized is fn or the C entry point. Petr Savicky. On Sat, Aug 04, 2007 at 06:56:47PM -0400, Gabor Grothendieck wrote: > I don't have an example of that but that does not make it less > desirable. If one wants to use method 1, 2 or 3 then one can > use optim with a method= but if one wants to use methods 4 > or 5 then one must use an entirely different function. Surely > it would be better to be consistent from the user's viewpoint > and allow all of them to work consistently through the same > interface. > > On 8/4/07, Duncan Murdoch <[EMAIL PROTECTED]> wrote: > > On 04/08/2007 2:53 PM, Gabor Grothendieck wrote: > > > The example of generic functions. > > > > Show me an example where we have a list of ways to do a calculation > > passed as an argument (analogous to the method argument of optim), where > > the user is allowed to add his own function to the list. > > > > Duncan Murdoch > > > __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Optimization in R
Thank you for your response. This is a good idea. Although I use my own packages, some of them using other R API's, I missed the optimization ones. Thanks again. Petr Savicky. On Mon, Aug 06, 2007 at 07:16:11AM -0700, Thomas Lumley wrote: > On Mon, 6 Aug 2007, Petr Savicky wrote: > > >Question. > > > >Is there a way how to optimize a function written in C > >using optim? > > The algorithms used by optim are all accessible from C. The manual > "Writing R Extensions" has a section on "The R API", including the > optimization routines. > > -thomas > __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] sweep sanity checking?
Thanks to Martin Maechler for his comments, advice and for pointing out the speed problem. Thanks also to Ben Bolker for tests of speed, which confirm that for small arrays, a slow down by a factor of about 1.2 - 1.5 may occur. Now, I would like to present a new version of sweep, which is simpler and has an option to avoid the test. This is expected to be used in scripts, where the programmer is quite sure that the usage is correct and speed is required. The new version differs from the previous one in the following: 1. The option check.margin has a different meaning. It defaults to TRUE and it determines whether the test is performed or not. 2. Since check.margin has the meaning above, it cannot be used to select, which test should be performed. This depends on the type of STATS. The suggested sweep function contains two tests: - a vector test by Heather Turner, which is used, if STATS has no dim attribute and, hence, is a vector (STATS should not be anything else than a vector or an array) - an array test used if STATS has dim attribute. The vector test allows some kinds of recycling, while the array test does not. Hence, in the most common case, where x is a matrix and STATS is a vector, if the user likes to be warned if the length of the vector is not exactly the right one, the following call is suggested: sweep(x,MARGIN,as.array(STATS)). Otherwise, a warning will be generated only if length(STATS) does not divide the specified dimension of x, which is nrow(x) (MARGIN=1) or ncol(x) (MARGIN=2). 3. If STATS is an array, then the test is more restrictive than in the previous version. It is now required that after deleting dimensions with one level, the remaining dimensions coincide. The previous version allowed additionally the cases, when dim(STATS) is a prefix of dim(x)[MARGIN], for example, if dim(STATS) = k1 and dim(x)[MARGIN] = c(k1,k2). The code of the tests in the suggested sweep is based on the previous suggestions https://stat.ethz.ch/pipermail/r-help/2005-June/073989.html by Robin Hankin https://stat.ethz.ch/pipermail/r-help/2005-June/074001.html by Heather Turner https://stat.ethz.ch/pipermail/r-devel/2007-June/046217.html by Ben Bolker with some further modifications. The modification of sweep.Rd was prepared by Ben Bolker and me. I would like to encourage everybody who likes to express his opinion on the patch to do it now. In my opinion, the suggestion of the new code stabilized in the sense that I will not modify it unless there is a negative feedback. A patch against R-devel_2007-08-06 is attached. It contains tabs. If they are corrupted by email transfer, use the link http://www.cs.cas.cz/~savicky/R-devel/patch-sweep which is an identical copy. Petr Savicky. --- R-devel_2007-08-06/src/library/base/R/sweep.R 2007-07-27 17:51:13.0 +0200 +++ R-devel_2007-08-06-sweep/src/library/base/R/sweep.R 2007-08-07 10:30:12.383672960 +0200 @@ -14,10 +14,29 @@ # A copy of the GNU General Public License is available at # http://www.r-project.org/Licenses/ -sweep <- function(x, MARGIN, STATS, FUN = "-", ...) +sweep <- function(x, MARGIN, STATS, FUN = "-", check.margin=TRUE, ...) { FUN <- match.fun(FUN) dims <- dim(x) + if (check.margin) { + dimmargin <- dims[MARGIN] + dimstats <- dim(STATS) + lstats <- length(STATS) + if (lstats > prod(dimmargin)) { + warning("length of STATS greater than the extent of dim(x)[MARGIN]") + } else if (is.null(dimstats)) { # STATS is a vector + cumDim <- c(1, cumprod(dimmargin)) + upper <- min(cumDim[cumDim >= lstats]) + lower <- max(cumDim[cumDim <= lstats]) + if (upper %% lstats != 0 || lstats %% lower != 0) + warning("STATS does not recycle exactly across MARGIN") + } else { + dimmargin <- dimmargin[dimmargin > 1] + dimstats <- dimstats[dimstats > 1] + if (length(dimstats) != length(dimmargin) || any(dimstats != dimmargin)) + warning("length(STATS) or dim(STATS) do not match dim(x)[MARGIN]") + } + } perm <- c(MARGIN, (1:length(dims))[ - MARGIN]) FUN(x, aperm(array(STATS, dims[perm]), order(perm)), ...) } --- R-devel_2007-08-06/src/library/base/man/sweep.Rd2007-07-27 17:51:35.0 +0200 +++ R-devel_2007-08-06-sweep/src/library/base/man/sweep.Rd 2007-08-07 10:29:45.517757200 +0200 @@ -11,7 +11,7 @@ statistic. } \usage{ -sweep(x, MARGIN, STATS, FUN="-", \dots) +sweep(x, MARGIN, STATS
Re: [Rd] [Fwd: behavior of L-BFGS-B with trivial function triggers bug in stats4::mle]
On Mon, Aug 13, 2007 at 05:49:51PM -0400, Ben Bolker wrote: [snip] > an undefined condition), but it leads to a bug in stats4::mle -- > a spurious error saying that a better fit > has been found during profiling if one tries to profile > a 1-parameter model that was originally fitted with "L-BFGS-B". [snip] Could you also include a script, which reproduces the problem? Just to see under which conditions the problem occurs and how it looks like exactly. Petr Savicky. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] paste() with NAs .. change worth persuing?
On Wed, Aug 22, 2007 at 08:53:39PM +0300, Jari Oksanen wrote: > > On 22 Aug 2007, at 20:16, Duncan Murdoch wrote: > > A fairly common use of paste is to put together reports for human > > consumption. Currently we have > > > >> p <- as.character(NA) > >> paste("the value of p is", p) > > [1] "the value of p is NA" > > > > which looks reasonable. Would this become > > > >> p <- as.character(NA) > >> paste("the value of p is", p) > > [1] NA > > > > under your proposal? (In a quick search I was unable to find a real > > example where this would happen, but it would worry me...) > > At least stop() seems to include such a case: > > message <- paste(args, collapse = "") > > and we may expect there are NAs sometimes in stop(). The examples show, that changing the behavior of paste in general may not be appropriate. On the other hand, if we concatenate character vectors, which are part of data, then is.na(paste(...,NA,...)) makes sense. Character vectors in data are usually represented by factors. On the other hand, factors are not typical in cases, where paste is used to produce a readable message. Hence, it could be possible to have is.na(u[i]) for those i, for which some of the vectors v1, ..., vn in u <- paste(v1,,vn) is a factor and has NA at i-th position. Petr Savicky. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel