Re: [Rd] A different error in sample()

2018-09-20 Thread Martin Maechler
> Wolfgang Huber 
> on Thu, 20 Sep 2018 08:47:47 +0200 writes:

> FWIW, I suspect this is related to the function
> R_unif_index that was introduced in src/main/RNG.c around
> revision 72356, or the way this function is used in
> do_sample in src/main/random.c.

Yes, it is just the use of 'dn' instead of 'n'
- a one letter thinko I'd say.

But *no*, it's much older than revision 72356; e.g., it's already in

R version 3.0.0 (2013-04-03) -- "Masked Marvel"

but not yet in

R version 2.15.3 (2013-03-01) -- "Security Blanket"



Here, I clearly think we see a regression bug, and hopefully not
one that should trigger often in practice...
and  -- without any statistics about the consequences out in
package space --
I do think we should fix this in code and let the documentation
become "great again" ;-)

Martin





> 20.9.18 08:19, Wolfgang Huber scripsit:
>> Besides wording of the documentation re truncating vs
>> rounding, there is something peculiar going on with the
>> fractional part of n:
>> 
>> > table(sample.int(2.5, 1e6, replace = TRUE))
>> 
>>  1  2  3 399051 401035 199914
>> 
>> > table(sample.int(3, 1e6, replace = TRUE))
>> 
>>  1  2  3 332956 332561 334483
>> 
>> > table(sample.int(2.01, 1e6, replace = TRUE))
>> 
>>  1  2  3 497173 497866   4961
>>

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] A different error in sample()

2018-09-20 Thread Kim Seonghyun
Hi,

I have not checked the source code, but I think it is because of banker's round.

https://en.wikipedia.org/wiki/Rounding#Round_half_to_even

Best regards,
Kim

-Original Message-
From: R-devel  On Behalf Of Dario Strbenac
Sent: den 20 september 2018 03:00
To: r-devel 
Subject: Re: [Rd] A different error in sample()

Good day,

The use of "rounding" also doesn't make sense. If The number is halfway between 
two integers, it is rounded to the nearest even integer.

> round(2.5)
[1] 2

--
Dario Strbenac
University of Sydney
Camperdown NSW 2050
Australia
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] A different error in sample()

2018-09-20 Thread lmo via R-devel
Although it seems to be pretty weird to enter a numeric vector of length one 
that is not an integer as the first argument to sample(), the results do not 
seem to match what is documented in the manual. In addition, the results below 
do not support the use of round rather than truncate in the documentation. 
Consider the code below.
The first sentence in the details section says: "If x has length 1, is numeric 
(in the sense of is.numeric) and x >= 1, sampling via sample takes place from 
1:x."
In the console:> 1:2.001
[1] 1 2
> 1:2.9
[1] 1 2

truncation:
> trunc(2.9)
[1] 2

So, this seems to support the quote from in previous emails: "Non-integer 
positive numerical values of n or x will be truncated to the next smallest 
integer, which has to be no larger than .Machine$integer.max."
However, again in the console:> set.seed(123)
 > table(sample(2.001, 1, replace=TRUE))

   1    2    3 
5052 4941    7

So, neither rounding nor truncation is occurring. Next, define a sequence.
> x <- seq(2.001, 2.51, length.out=20)
Now, grab all of the threes from sample()-ing this sequence.

 > set.seed(123)
> threes <- sapply(x, function(y) table(sample(y, 1, replace=TRUE))[3])

Check for NAs (I cheated here and found a nice seed).> any(is.na(threes))
[1] FALSE
Now, the (to me) disturbing result.

> is.unsorted(threes)
[1] FALSE

or equivalently

> all(diff(threes) > 0)
[1] TRUE

So the number of threes grows monotonically as 2.001 moves to 2.5. As I hinted 
above, the monotonic growth is not assured. My guess is that the growth is 
stochastic and relates to some "probability weighting" based on how close the 
element of x is to 3. Perhaps this has been brought up before, but it seems 
relevant to the current discussion.
A potential aid to this issue would be something like
if(length(x) == 1 && !all.equal(x, as.integer(x))) warning("It is a bad idea to 
use vectors of length 1 in the x argument that are not integers.")
Hope that helps,luke

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] segfault issue with parallel::mclapply and download.file() on Mac OS X

2018-09-20 Thread Seth Russell
I have an lapply function call that I want to parallelize. Below is a very
simplified version of the code:

url_base <- "https://cloud.r-project.org/src/contrib/";
files <- c("A3_1.0.0.tar.gz", "ABC.RAP_0.9.0.tar.gz")
res <- parallel::mclapply(files, function(s) download.file(paste0(url_base,
s), s))

Instead of download a couple of files in parallel, I get a segfault per
process with a 'memory not mapped' message. I've been working with Henrik
Bengtsson on resolving this issue and he recommended I send a message to
the R-Devel mailing list.

Here's the output:

trying URL 'https://cloud.r-project.org/src/contrib/A3_1.0.0.tar.gz'
trying URL 'https://cloud.r-project.org/src/contrib/ABC.RAP_0.9.0.tar.gz'

 *** caught segfault ***
address 0x11575ba3a, cause 'memory not mapped'

 *** caught segfault ***
address 0x11575ba3a, cause 'memory not mapped'

Traceback:
 1: download.file(paste0(url_base, s), s)
 2: FUN(X[[i]], ...)
 3: lapply(X = S, FUN = FUN, ...)
 4: doTryCatch(return(expr), name, parentenv, handler)
 5: tryCatchOne(expr, names, parentenv, handlers[[1L]])
 6: tryCatchList(expr, classes, parentenv, handlers)
 7: tryCatch(expr, error = function(e) {call <- conditionCall(e)if
(!is.null(call)) {if (identical(call[[1L]], quote(doTryCatch)))
call <- sys.call(-4L)dcall <- deparse(call)[1L]
 prefix <- paste("Error in", dcall, ": ")
LONG <- 75LTraceback:
sm <- strsplit(conditionMessage(e), "\n")[[1L]] 1: w <- 14L
+ nchar(dcall, type = "w") + nchar(sm[1L], type = "w")if (is.na(w))
download.file(paste0(url_base, s), s)w <- 14L + nchar(dcall,
type = "b") + nchar(sm[1L],
type = "b")if (w > LONG)  2: FUN(X[[i]], ...)
 3: lapply(X = S, FUN = FUN, ...)
 4: doTryCatch(return(expr), name, parentenv, handler)
 5: tryCatchOne(expr, names, parentenv, handlers[[1L]])
 6: prefix <- paste0(prefix, "\n  ")tryCatchList(expr, classes,
parentenv, handlers)
}else prefix <- "Error : " 7: msg <- paste0(prefix,
conditionMessage(e), "\n")tryCatch(expr, error = function(e) {
 .Internal(seterrmessage(msg[1L]))call <- conditionCall(e)if
(!silent && isTRUE(getOption("show.error.messages"))) {if
(!is.null(call)) {cat(msg, file = outFile)if
(identical(call[[1L]], quote(doTryCatch)))
.Internal(printDeferredWarnings())call <- sys.call(-4L)}
 dcall <- deparse(call)[1L]invisible(structure(msg, class =
"try-error", condition = e))prefix <- paste("Error in", dcall, ":
")})LONG <- 75Lsm <- strsplit(conditionMessage(e),
"\n")[[1L]]
w <- 14L + nchar(dcall, type = "w") + nchar(sm[1L], type = "w")
   if (is.na(w))  8: w <- 14L + nchar(dcall, type = "b") +
nchar(sm[1L], try(lapply(X = S, FUN = FUN, ...), silent = TRUE)
   type = "b")
if (w > LONG) prefix <- paste0(prefix, "\n  ") 9:
}sendMaster(try(lapply(X = S, FUN = FUN, ...), silent = TRUE))else
prefix <- "Error : "
msg <- paste0(prefix, conditionMessage(e), "\n")
 .Internal(seterrmessage(msg[1L]))10: if (!silent &&
isTRUE(getOption("show.error.messages"))) {FUN(X[[i]], ...)cat(msg,
file = outFile)
.Internal(printDeferredWarnings())}11:
invisible(structure(msg, class = "try-error", condition =
e))lapply(seq_len(cores), inner.do)})

12:  8: parallel::mclapply(files, function(s)
download.file(paste0(url_base, try(lapply(X = S, FUN = FUN, ...), silent =
TRUE)s), s))

 9:
sendMaster(try(lapply(X = S, FUN = FUN, ...), silent = TRUE))Possible
actions:

1: abort (with core dump, if enabled)
2: normal R exit
10: 3: exit R without saving workspace
FUN(X[[i]], ...)4: exit R saving workspace

11: lapply(seq_len(cores), inner.do)
12: parallel::mclapply(files, function(s) download.file(paste0(url_base,
  s), s))

Here's my sessionInfo()

> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin16.7.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS/LAPACK: /usr/local/Cellar/openblas/0.3.3/lib/libopenblasp-r0.3.3.dylib

locale:
[1] en_US/en_US/en_US/C/en_US/en_US

attached base packages:
[1] parallel  stats graphics  grDevices utils datasets  methods
[8] base

loaded via a namespace (and not attached):
[1] compiler_3.5.1

My version of R I'm running was installed via homebrew with "brew install r
--with-java --with-openblas"

Also, the provided example code works as expected on Linux. Also, if I
provide a non-default download method to the download.file() call such as:

res <- parallel::mclapply(files, function(s) download.file(paste0(url_base,
s), s, method="wget"))
res <- parallel::mclapply(files, function(s) download.file(paste0(url_base,
s), s, method="curl"))

It works correctly - no segfault. If I use method="libcurl" it does
segfault.

I'm not sure what steps to take to further narrow down the source of the
error.

Is this a known bug? if not, is this a new bug or an unexpected feat

Re: [Rd] A different error in sample()

2018-09-20 Thread Emil Bode
But do we handle it as an error in what sample does, or how the documentation 
is?
I think what is done now would be best described as "ceilinged", i.e. what 
ceiling() does. But is there an English word to describe this?
Or just use "converted to the next smallest integer"?

But then again, what happens is that the answer is ceilinged, not the input.
I guess the rationale is that multiplying by any integer and then dividing 
should give the same results:
ceiling(sample(n * x, size=1e6, replace = TRUE) / x) gives the same results for 
any integer n and x, it's nice that this also holds for non-integer n.
The most important thing is why people would use sample with a non-integer x, I 
don’t see many use cases.
So I agree with Luke that a warning would be best, regardless of what the docs 
say.

Best regards, 
Emil Bode

Although it seems to be pretty weird to enter a numeric vector of length 
one that is not an integer as the first argument to sample(), the results do 
not seem to match what is documented in the manual. In addition, the results 
below do not support the use of round rather than truncate in the 
documentation. Consider the code below.
The first sentence in the details section says: "If x has length 1, is 
numeric (in the sense of is.numeric) and x >= 1, sampling via sample takes 
place from 1:x."
In the console:> 1:2.001
[1] 1 2
> 1:2.9
[1] 1 2

truncation:
> trunc(2.9)
[1] 2

So, this seems to support the quote from in previous emails: "Non-integer 
positive numerical values of n or x will be truncated to the next smallest 
integer, which has to be no larger than .Machine$integer.max."
However, again in the console:> set.seed(123)
 > table(sample(2.001, 1, replace=TRUE))

   123 
5052 49417

So, neither rounding nor truncation is occurring. Next, define a sequence.
> x <- seq(2.001, 2.51, length.out=20)
Now, grab all of the threes from sample()-ing this sequence.

 > set.seed(123)
> threes <- sapply(x, function(y) table(sample(y, 1, replace=TRUE))[3])

Check for NAs (I cheated here and found a nice seed).> any(is.na(threes))
[1] FALSE
Now, the (to me) disturbing result.

> is.unsorted(threes)
[1] FALSE

or equivalently

> all(diff(threes) > 0)
[1] TRUE

So the number of threes grows monotonically as 2.001 moves to 2.5. As I 
hinted above, the monotonic growth is not assured. My guess is that the growth 
is stochastic and relates to some "probability weighting" based on how close 
the element of x is to 3. Perhaps this has been brought up before, but it seems 
relevant to the current discussion.
A potential aid to this issue would be something like
if(length(x) == 1 && !all.equal(x, as.integer(x))) warning("It is a bad 
idea to use vectors of length 1 in the x argument that are not integers.")
Hope that helps,luke

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] A different error in sample()

2018-09-20 Thread Joris Meys
To be more clear: I do NOT state that the function "round" is used. I read
the documentation as "non integer positive numerical values will be
replaced by the next smallest integer", the important part being the NEXT
smallest integer, i.e. how ceiling() does it. So that's exactly what I
would expect. If "replaced by" causes less confusion than "rounded to" or
"truncated to", then use that.

I do agree that this wording would still indicate that this happens prior
to the sampling, whereas the output indicates that this is done after the
sampling. I can reproduce the sample() outcome using runif() as follows:

> table(ceiling(runif(1,0,2.1)))
   123
4774 4756  470

> table(ceiling(runif(1,0,3)))
   123
3273 3440 3287

I don't know if that's the intended behaviour, but there is some logic in
it. It's up to the R core team to decide if this is OK and rephrase the
help page so it becomes more clear what actually happens, or simply add
something like

if( (x%%1) != 0) x <- ceiling(x)

prior to the sampling algorithm.

Cheers
Joris

On Thu, Sep 20, 2018 at 9:44 AM lmo via R-devel 
wrote:

> Although it seems to be pretty weird to enter a numeric vector of length
> one that is not an integer as the first argument to sample(), the results
> do not seem to match what is documented in the manual. In addition, the
> results below do not support the use of round rather than truncate in the
> documentation. Consider the code below.
> The first sentence in the details section says: "If x has length 1, is
> numeric (in the sense of is.numeric) and x >= 1, sampling via sample takes
> place from 1:x."
> In the console:> 1:2.001
> [1] 1 2
> > 1:2.9
> [1] 1 2
>
> truncation:
> > trunc(2.9)
> [1] 2
>
> So, this seems to support the quote from in previous emails: "Non-integer
> positive numerical values of n or x will be truncated to the next smallest
> integer, which has to be no larger than .Machine$integer.max."
> However, again in the console:> set.seed(123)
>  > table(sample(2.001, 1, replace=TRUE))
>
>123
> 5052 49417
>
> So, neither rounding nor truncation is occurring. Next, define a sequence.
> > x <- seq(2.001, 2.51, length.out=20)
> Now, grab all of the threes from sample()-ing this sequence.
>
>  > set.seed(123)
> > threes <- sapply(x, function(y) table(sample(y, 1, replace=TRUE))[3])
>
> Check for NAs (I cheated here and found a nice seed).> any(is.na(threes))
> [1] FALSE
> Now, the (to me) disturbing result.
>
> > is.unsorted(threes)
> [1] FALSE
>
> or equivalently
>
> > all(diff(threes) > 0)
> [1] TRUE
>
> So the number of threes grows monotonically as 2.001 moves to 2.5. As I
> hinted above, the monotonic growth is not assured. My guess is that the
> growth is stochastic and relates to some "probability weighting" based on
> how close the element of x is to 3. Perhaps this has been brought up
> before, but it seems relevant to the current discussion.
> A potential aid to this issue would be something like
> if(length(x) == 1 && !all.equal(x, as.integer(x))) warning("It is a bad
> idea to use vectors of length 1 in the x argument that are not integers.")
> Hope that helps,luke
>
> [[alternative HTML version deleted]]
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>


-- 
Joris Meys
Statistical consultant

Department of Data Analysis and Mathematical Modelling
Ghent University
Coupure Links 653, B-9000 Gent (Belgium)


---
Biowiskundedagen 2017-2018
http://www.biowiskundedagen.ugent.be/

---
Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] segfault issue with parallel::mclapply and download.file() on Mac OS X

2018-09-20 Thread Martin Maechler
> Seth Russell 
> on Wed, 19 Sep 2018 15:19:48 -0600 writes:

> I have an lapply function call that I want to parallelize. Below is a very
> simplified version of the code:

> url_base <- "https://cloud.r-project.org/src/contrib/";
> files <- c("A3_1.0.0.tar.gz", "ABC.RAP_0.9.0.tar.gz")
> res <- parallel::mclapply(files, function(s) 
download.file(paste0(url_base,
> s), s))

> Instead of download a couple of files in parallel, I get a segfault per
> process with a 'memory not mapped' message. I've been working with Henrik
> Bengtsson on resolving this issue and he recommended I send a message to
> the R-Devel mailing list.

Thank you for the simple reproducible (*) example.

If I run the above in either R-devel  or R 3.5.1, it works
flawlessly [on Linux Fedora 28].  ah, now I see you say so
much later... also that other methods than "libcurl" work.

To note here is that "libcurl" is also the default method on
Linux where things work.

I've also tried it on the Windows server I've easily access and
the following code -- also explicitly using  "libcurl" --

##--
url_base <- "https://cloud.r-project.org/src/contrib/";
files <- c("A3_1.0.0.tar.gz", "ABC.RAP_0.9.0.tar.gz")
res <- parallel::mclapply(files, function(s)
download.file(paste0(url_base, s), s, method="libcurl"))
##--

works fine there too.

- So maybe this should have gone to the R-SIG-Mac mailing list
  instead of this one ??

- Can other MacOS R users try and see?

--
*) at least till one of the 2 packages gets updated ! ;-)

> Here's the output:

> trying URL 'https://cloud.r-project.org/src/contrib/A3_1.0.0.tar.gz'
> trying URL 'https://cloud.r-project.org/src/contrib/ABC.RAP_0.9.0.tar.gz'

> *** caught segfault ***
> address 0x11575ba3a, cause 'memory not mapped'

> *** caught segfault ***
> address 0x11575ba3a, cause 'memory not mapped'

> Traceback:
> 1: download.file(paste0(url_base, s), s)
> 2: FUN(X[[i]], ...)
> 3: lapply(X = S, FUN = FUN, ...)
> 4: doTryCatch(return(expr), name, parentenv, handler)
> 5: tryCatchOne(expr, names, parentenv, handlers[[1L]])
> 6: tryCatchList(expr, classes, parentenv, handlers)
> 7: tryCatch(expr, error = function(e) {call <- conditionCall(e)if
> (!is.null(call)) {if (identical(call[[1L]], quote(doTryCatch)))
> call <- sys.call(-4L)dcall <- deparse(call)[1L]
> prefix <- paste("Error in", dcall, ": ")
> LONG <- 75LTraceback:
> sm <- strsplit(conditionMessage(e), "\n")[[1L]] 1: w <- 14L
> + nchar(dcall, type = "w") + nchar(sm[1L], type = "w")if 
(is.na(w))
> download.file(paste0(url_base, s), s)w <- 14L + nchar(dcall,
> type = "b") + nchar(sm[1L],
> type = "b")if (w > LONG)  2: FUN(X[[i]], ...)
> 3: lapply(X = S, FUN = FUN, ...)
> 4: doTryCatch(return(expr), name, parentenv, handler)
> 5: tryCatchOne(expr, names, parentenv, handlers[[1L]])
> 6: prefix <- paste0(prefix, "\n  ")tryCatchList(expr, classes,
> parentenv, handlers)
> }else prefix <- "Error : " 7: msg <- paste0(prefix,
> conditionMessage(e), "\n")tryCatch(expr, error = function(e) {
> .Internal(seterrmessage(msg[1L]))call <- conditionCall(e)if
> (!silent && isTRUE(getOption("show.error.messages"))) {if
> (!is.null(call)) {cat(msg, file = outFile)if
> (identical(call[[1L]], quote(doTryCatch)))
> .Internal(printDeferredWarnings())call <- sys.call(-4L)}
> dcall <- deparse(call)[1L]invisible(structure(msg, class =
> "try-error", condition = e))prefix <- paste("Error in", dcall, ":
> ")})LONG <- 75Lsm <- strsplit(conditionMessage(e),
> "\n")[[1L]]
> w <- 14L + nchar(dcall, type = "w") + nchar(sm[1L], type = "w")
> if (is.na(w))  8: w <- 14L + nchar(dcall, type = "b") +
> nchar(sm[1L], try(lapply(X = S, FUN = FUN, ...), silent = TRUE)
> type = "b")
> if (w > LONG) prefix <- paste0(prefix, "\n  ") 9:
> }sendMaster(try(lapply(X = S, FUN = FUN, ...), silent = TRUE))else
> prefix <- "Error : "
> msg <- paste0(prefix, conditionMessage(e), "\n")
> .Internal(seterrmessage(msg[1L]))10: if (!silent &&
> isTRUE(getOption("show.error.messages"))) {FUN(X[[i]], ...)
cat(msg,
> file = outFile)
> .Internal(printDeferredWarnings())}11:
> invisible(structure(msg, class = "try-error", condition =
> e))lapply(seq_len(cores), inner.do)})

> 12:  8: parallel::mclapply(files, function(s)
> download.file(paste0(url_base, try(lapply(X = S, FUN = FUN, ...), silent =
> TRUE)s), s))

> 9:
> sendMaster(try(lapply(X = S, FUN = FUN, ...), silent = TRUE))Possible
> actions

Re: [Rd] A different error in sample()

2018-09-20 Thread peter dalgaard
Yup, that is a bug, at least in the documentation. Probably a clearer example 
is 

x <- seq(2.001, 2.999, length.out=999)
threes <- sapply(x, function(y) table(sample(y, 1, replace=TRUE))[3])
plot(threes, type="l")
curve(1*(x-2)/x, add=TRUE, col="red")

which is entirely consistent with what you'd expect from floor(runif(1, 0, 
y)) + 1, and as far as I can tell from the source, that is what is happening 
internally. 

(Strict monotonicity is a bit of a red herring, it is jut a matter of having 
spaced the y so far apart that the probability of an order reversal becomes 
negligible.)

So either we should do what the documentation says we do, or the documentation 
should not say that we do what we do not actually do...

The suspect code is this snippet from do_sample:

int n = (int) dn;
.

if (replace || k < 2) {
for (int i = 0; i < k; i++) iy[i] = (int)(R_unif_index(dn) + 1);
} else {
int *x = (int *)R_alloc(n, sizeof(int));
for (int i = 0; i < n; i++) x[i] = i;
for (int i = 0; i < k; i++) {
int j = (int)(R_unif_index(n));
iy[i] = x[j] + 1;
x[j] = x[--n];
}
}

(notice arguments to R_unif_index)

-pd

> On 20 Sep 2018, at 01:53 , lmo via R-devel  wrote:
> 
> Although it seems to be pretty weird to enter a numeric vector of length one 
> that is not an integer as the first argument to sample(), the results do not 
> seem to match what is documented in the manual. In addition, the results 
> below do not support the use of round rather than truncate in the 
> documentation. Consider the code below.
> The first sentence in the details section says: "If x has length 1, is 
> numeric (in the sense of is.numeric) and x >= 1, sampling via sample takes 
> place from 1:x."
> In the console:> 1:2.001
> [1] 1 2
>> 1:2.9
> [1] 1 2
> 
> truncation:
>> trunc(2.9)
> [1] 2
> 
> So, this seems to support the quote from in previous emails: "Non-integer 
> positive numerical values of n or x will be truncated to the next smallest 
> integer, which has to be no larger than .Machine$integer.max."
> However, again in the console:> set.seed(123)
>> table(sample(2.001, 1, replace=TRUE))
> 
>123 
> 5052 49417
> 
> So, neither rounding nor truncation is occurring. Next, define a sequence.
>> x <- seq(2.001, 2.51, length.out=20)
> Now, grab all of the threes from sample()-ing this sequence.
> 
>> set.seed(123)
>> threes <- sapply(x, function(y) table(sample(y, 1, replace=TRUE))[3])
> 
> Check for NAs (I cheated here and found a nice seed).> any(is.na(threes))
> [1] FALSE
> Now, the (to me) disturbing result.
> 
>> is.unsorted(threes)
> [1] FALSE
> 
> or equivalently
> 
>> all(diff(threes) > 0)
> [1] TRUE
> 
> So the number of threes grows monotonically as 2.001 moves to 2.5. As I 
> hinted above, the monotonic growth is not assured. My guess is that the 
> growth is stochastic and relates to some "probability weighting" based on how 
> close the element of x is to 3. Perhaps this has been brought up before, but 
> it seems relevant to the current discussion.
> A potential aid to this issue would be something like
> if(length(x) == 1 && !all.equal(x, as.integer(x))) warning("It is a bad idea 
> to use vectors of length 1 in the x argument that are not integers.")
> Hope that helps,luke
> 
>   [[alternative HTML version deleted]]
> 
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd@cbs.dk  Priv: pda...@gmail.com

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] segfault issue with parallel::mclapply and download.file() on Mac OS X

2018-09-20 Thread Gábor Csárdi
This code actually happens to work for me on macOS, but I think in
general you cannot rely on performing HTTP requests in fork clusters,
i.e. with mclapply().

Fork clusters create worker processes by forking the R process and
then _not_ executing another R binary. (Which is often convenient,
because the new processes will inherit the memory image of the parent
process.)

Fork without exec is not supported by macOS, basically any calls to
system libraries might crash. (Ie. not just HTTP-related calls.) For
HTTP calls I have seen errors, crashes, and sometimes it works.
Depends on the combination of libcurl version, macOS version and
probably luck.

It usually (always?) works on Linux, but I would not rely on that, either.

So, yes, this is a known issue.

Creating new processes to perform HTTP in parallel is very often bad
practice, actually. Whenever you can, use I/O multiplexing instead,
since the main R process is not doing anything, anyway, just waiting
for the data to come in. So you don't need more processes, you need
parallel I/O. Take a look at the curl::multi_add() etc. functions.

Btw. download.file() can actually download files in parallel if the
liburl method is used, just give it a list of URLs in a character
vector. This API is very restricted, though, so I suggest to look at
the curl package.

GaborOn Thu, Sep 20, 2018 at 8:44 AM Seth Russell
 wrote:
>
> I have an lapply function call that I want to parallelize. Below is a very
> simplified version of the code:
>
> url_base <- "https://cloud.r-project.org/src/contrib/";
> files <- c("A3_1.0.0.tar.gz", "ABC.RAP_0.9.0.tar.gz")
> res <- parallel::mclapply(files, function(s) download.file(paste0(url_base,
> s), s))
>
> Instead of download a couple of files in parallel, I get a segfault per
> process with a 'memory not mapped' message. I've been working with Henrik
> Bengtsson on resolving this issue and he recommended I send a message to
> the R-Devel mailing list.
>
> Here's the output:
>
> trying URL 'https://cloud.r-project.org/src/contrib/A3_1.0.0.tar.gz'
> trying URL 'https://cloud.r-project.org/src/contrib/ABC.RAP_0.9.0.tar.gz'
>
>  *** caught segfault ***
> address 0x11575ba3a, cause 'memory not mapped'
>
>  *** caught segfault ***
> address 0x11575ba3a, cause 'memory not mapped'
>
> Traceback:
>  1: download.file(paste0(url_base, s), s)
>  2: FUN(X[[i]], ...)
>  3: lapply(X = S, FUN = FUN, ...)
>  4: doTryCatch(return(expr), name, parentenv, handler)
>  5: tryCatchOne(expr, names, parentenv, handlers[[1L]])
>  6: tryCatchList(expr, classes, parentenv, handlers)
>  7: tryCatch(expr, error = function(e) {call <- conditionCall(e)if
> (!is.null(call)) {if (identical(call[[1L]], quote(doTryCatch)))
> call <- sys.call(-4L)dcall <- deparse(call)[1L]
>  prefix <- paste("Error in", dcall, ": ")
> LONG <- 75LTraceback:
> sm <- strsplit(conditionMessage(e), "\n")[[1L]] 1: w <- 14L
> + nchar(dcall, type = "w") + nchar(sm[1L], type = "w")if (is.na(w))
> download.file(paste0(url_base, s), s)w <- 14L + nchar(dcall,
> type = "b") + nchar(sm[1L],
> type = "b")if (w > LONG)  2: FUN(X[[i]], ...)
>  3: lapply(X = S, FUN = FUN, ...)
>  4: doTryCatch(return(expr), name, parentenv, handler)
>  5: tryCatchOne(expr, names, parentenv, handlers[[1L]])
>  6: prefix <- paste0(prefix, "\n  ")tryCatchList(expr, classes,
> parentenv, handlers)
> }else prefix <- "Error : " 7: msg <- paste0(prefix,
> conditionMessage(e), "\n")tryCatch(expr, error = function(e) {
>  .Internal(seterrmessage(msg[1L]))call <- conditionCall(e)if
> (!silent && isTRUE(getOption("show.error.messages"))) {if
> (!is.null(call)) {cat(msg, file = outFile)if
> (identical(call[[1L]], quote(doTryCatch)))
> .Internal(printDeferredWarnings())call <- sys.call(-4L)}
>  dcall <- deparse(call)[1L]invisible(structure(msg, class =
> "try-error", condition = e))prefix <- paste("Error in", dcall, ":
> ")})LONG <- 75Lsm <- strsplit(conditionMessage(e),
> "\n")[[1L]]
> w <- 14L + nchar(dcall, type = "w") + nchar(sm[1L], type = "w")
>if (is.na(w))  8: w <- 14L + nchar(dcall, type = "b") +
> nchar(sm[1L], try(lapply(X = S, FUN = FUN, ...), silent = TRUE)
>type = "b")
> if (w > LONG) prefix <- paste0(prefix, "\n  ") 9:
> }sendMaster(try(lapply(X = S, FUN = FUN, ...), silent = TRUE))else
> prefix <- "Error : "
> msg <- paste0(prefix, conditionMessage(e), "\n")
>  .Internal(seterrmessage(msg[1L]))10: if (!silent &&
> isTRUE(getOption("show.error.messages"))) {FUN(X[[i]], ...)cat(msg,
> file = outFile)
> .Internal(printDeferredWarnings())}11:
> invisible(structure(msg, class = "try-error", condition =
> e))lapply(seq_len(cores), inner.do)})
>
> 12:  8: parallel::mclapply(files, function(s)
> download.file(paste0(url_base, try(lapply(X = S, FUN =

[Rd] future time stamps warning

2018-09-20 Thread Leo Lahti
Dear developers,

Upon CRAN submission I have bumped into "future file timestamps" warning
that I can't solve. I have updated the package as usual, and all checks go
through in my system. CRAN reports the following warning however.

* checking for future file timestanps ... WARNING
Files with future time stamps:
  DESCRIPTION
  NAMESPACE
  README.md

The build log is at
https://win-builder.r-project.org/incoming_pretest/eurostat_3.2.7_20180920_004954/Debian/00check.log

The package is at
https://github.com/rOpenGov/eurostat/


I did not find guidance (search engines, R extensions manual)
on how to solve this. Perhaps we would need to generate the
files in a certain order but not sure what is the correct order.

Any tips?

kind regards

Leo Lahti

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] future time stamps warning

2018-09-20 Thread Duncan Murdoch

On 20/09/2018 6:56 AM, Leo Lahti wrote:

Dear developers,

Upon CRAN submission I have bumped into "future file timestamps" warning
that I can't solve. I have updated the package as usual, and all checks go
through in my system. CRAN reports the following warning however.

* checking for future file timestanps ... WARNING
Files with future time stamps:
   DESCRIPTION
   NAMESPACE
   README.md

The build log is at
https://win-builder.r-project.org/incoming_pretest/eurostat_3.2.7_20180920_004954/Debian/00check.log

The package is at
https://github.com/rOpenGov/eurostat/


I did not find guidance (search engines, R extensions manual)
on how to solve this. Perhaps we would need to generate the
files in a certain order but not sure what is the correct order.

Any tips?


What are the time stamps on those files?  You can list the contents of 
the tarball at the command line using


tar ztvf wCorr_1.9.0.tar.gz

Are the timestamps correct?

Duncan Murdoch

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] future time stamps warning

2018-09-20 Thread Gábor Csárdi
Have you tried setting your system clock correctly and then Sys.setFileTime()?

Gabor
On Thu, Sep 20, 2018 at 10:58 AM Leo Lahti  wrote:
>
> Dear developers,
>
> Upon CRAN submission I have bumped into "future file timestamps" warning
> that I can't solve. I have updated the package as usual, and all checks go
> through in my system. CRAN reports the following warning however.
>
> * checking for future file timestanps ... WARNING
> Files with future time stamps:
>   DESCRIPTION
>   NAMESPACE
>   README.md
>
> The build log is at
> https://win-builder.r-project.org/incoming_pretest/eurostat_3.2.7_20180920_004954/Debian/00check.log
>
> The package is at
> https://github.com/rOpenGov/eurostat/
>
>
> I did not find guidance (search engines, R extensions manual)
> on how to solve this. Perhaps we would need to generate the
> files in a certain order but not sure what is the correct order.
>
> Any tips?
>
> kind regards
>
> Leo Lahti
>
> [[alternative HTML version deleted]]
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] future time stamps warning

2018-09-20 Thread Leo Lahti
Time stamps are correct and my system time is correct.

I am now tried to use Sys.setFileTime() to update time stamps as proposed.
This does not help.

The windows and debian builds give different reports on the time stamp
issue.
https://win-builder.r-project.org/incoming_pretest/eurostat_3.2.8_20180920_122655/Windows/00check.log
https://win-builder.r-project.org/incoming_pretest/eurostat_3.2.8_20180920_122655/Debian/00check.log

I attached the time stamp listing below.

Leo



lei@kone:~/Rpackages/eurostat/inst/extras$ tar zvtf eurostat_3.2.8.tar.gz
drwxr-xr-x lei/lei   0 2018-09-20 13:23 eurostat/build/
-rw-r--r-- lei/lei 301 2018-09-20 13:23 eurostat/build/vignette.rds
drwxr-xr-x lei/lei   0 2018-09-20 13:23 eurostat/data/
-rw-r--r-- lei/lei  98 2018-09-20 13:23 eurostat/data/datalist
-rwxr-xr-x lei/lei 340 2018-06-17 20:44
eurostat/data/ea_countries.rda
-rwxr-xr-x lei/lei 214 2018-06-17 20:44
eurostat/data/efta_countries.rda
-rwxr-xr-x lei/lei 252 2018-06-17 20:44
eurostat/data/eu_candidate_countries.rda
-rwxr-xr-x lei/lei 431 2018-06-17 20:44
eurostat/data/eu_countries.rda
-rw-r--r-- lei/lei 3651661 2018-08-28 18:44
eurostat/data/eurostat_geodata_60_2016.rda
-rwxr-xr-x lei/lei5596 2018-06-17 20:44 eurostat/data/tgs00026.rda
-rwxr-xr-x lei/lei1447 2018-09-20 13:23 eurostat/DESCRIPTION
drwxr-xr-x lei/lei   0 2018-09-20 13:23 eurostat/inst/
-rwxr-xr-x lei/lei1060 2018-06-17 20:44 eurostat/inst/CITATION
drwxr-xr-x lei/lei   0 2018-09-20 13:23 eurostat/inst/doc/
-rw-r--r-- lei/lei6361 2018-09-20 13:22
eurostat/inst/doc/blogposts.html
-rwxr-xr-x lei/lei 518 2018-06-17 20:44
eurostat/inst/doc/blogposts.Rmd
-rw-r--r-- lei/lei6455 2018-09-20 13:22
eurostat/inst/doc/cheatsheet.html
-rwxr-xr-x lei/lei 610 2018-06-17 20:44
eurostat/inst/doc/cheatsheet.Rmd
-rw-r--r-- lei/lei  246912 2018-09-20 01:43
eurostat/inst/doc/eurostat_tutorial.pdf
-rw-r--r-- lei/lei   10565 2018-09-20 13:23
eurostat/inst/doc/eurostat_tutorial.R
-rwxr-xr-x lei/lei   17177 2018-09-20 00:07
eurostat/inst/doc/eurostat_tutorial.Rmd
-rw-r--r-- lei/lei6834 2018-09-20 13:23
eurostat/inst/doc/publications.html
-rwxr-xr-x lei/lei 942 2018-06-17 20:44
eurostat/inst/doc/publications.Rmd
-rwxr-xr-x lei/lei  92 2018-06-17 20:44 eurostat/LICENSE
drwxr-xr-x lei/lei   0 2018-08-28 18:44 eurostat/man/
-rwxr-xr-x lei/lei 697 2018-06-17 20:44
eurostat/man/clean_eurostat_cache.Rd
-rwxr-xr-x lei/lei 395 2018-06-17 20:44
eurostat/man/convert_time_col.Rd
-rwxr-xr-x lei/lei1276 2018-06-17 20:44
eurostat/man/cut_to_classes.Rd
-rwxr-xr-x lei/lei1133 2018-06-17 20:44 eurostat/man/dic_order.Rd
-rwxr-xr-x lei/lei 788 2018-06-17 20:44 eurostat/man/eu_countries.Rd
-rw-r--r-- lei/lei 826 2018-08-28 18:44
eurostat/man/eurostat_geodata_60_2016.Rd
-rwxr-xr-x lei/lei 805 2018-06-17 20:44
eurostat/man/eurostat-package.Rd
-rwxr-xr-x lei/lei1220 2018-06-17 20:44
eurostat/man/eurotime2date.Rd
-rwxr-xr-x lei/lei 978 2018-06-17 20:44 eurostat/man/eurotime2num.Rd
-rwxr-xr-x lei/lei1169 2018-06-17 20:44
eurostat/man/get_eurostat_dic.Rd
-rwxr-xr-x lei/lei2321 2018-08-28 18:44
eurostat/man/get_eurostat_geospatial.Rd
-rwxr-xr-x lei/lei2675 2018-08-28 18:44
eurostat/man/get_eurostat_json.Rd
-rwxr-xr-x lei/lei1172 2018-06-17 20:44
eurostat/man/get_eurostat_raw.Rd
-rwxr-xr-x lei/lei1191 2018-06-17 20:44
eurostat/man/get_eurostat_toc.Rd
-rwxr-xr-x lei/lei6351 2018-08-28 18:44 eurostat/man/get_eurostat.Rd
-rwxr-xr-x lei/lei 783 2018-06-17 20:44
eurostat/man/harmonize_country_code.Rd
-rwxr-xr-x lei/lei2518 2018-06-17 20:44
eurostat/man/label_eurostat.Rd
-rwxr-xr-x lei/lei2015 2018-06-17 20:44
eurostat/man/search_eurostat.Rd
-rwxr-xr-x lei/lei 453 2018-06-17 20:44
eurostat/man/set_eurostat_toc.Rd
-rwxr-xr-x lei/lei 344 2018-06-17 20:44 eurostat/man/tgs00026.Rd
-rwxr-xr-x lei/lei1696 2018-06-17 20:44
eurostat/man/tidy_eurostat.Rd
-rwxr-xr-x lei/lei1273 2018-09-20 13:21 eurostat/NAMESPACE
drwxr-xr-x lei/lei   0 2018-09-20 13:23 eurostat/R/
-rwxr-xr-x lei/lei1140 2018-06-17 20:44
eurostat/R/clean_eurostat_cache.R
-rwxr-xr-x lei/lei3191 2018-06-17 20:44 eurostat/R/cut_to_classes.R
-rwxr-xr-x lei/lei 638 2018-06-17 20:44 eurostat/R/data_countries.R
-rwxr-xr-x lei/lei4571 2018-08-28 18:44 eurostat/R/data_spatial.R
-rwxr-xr-x lei/lei 210 2018-06-17 20:44 eurostat/R/data.R
-rwxr-xr-x lei/lei1310 2018-06-17 20:44 eurostat/R/dic_order.R
-rwxr-xr-x lei/lei 721 2018-06-17 20:44
eurostat/R/eurostat-package.R
-rwxr-xr-x lei/lei2725 2018-06-17 20:44 eurostat/R/eurotime2date.R
-rwxr-xr-x lei/lei1918 2018-06-17 20:44 eurostat/R/eurotime2num.R
-rwxr-xr-x lei/lei 967 2018-

Re: [Rd] future time stamps warning

2018-09-20 Thread Gábor Csárdi
On Thu, Sep 20, 2018 at 11:46 AM Leo Lahti  wrote:
>
> Time stamps are correct and my system time is correct.
>
> I am now tried to use Sys.setFileTime() to update time stamps as proposed.
> This does not help.
>
> The windows and debian builds give different reports on the time stamp
> issue.
> https://win-builder.r-project.org/incoming_pretest/eurostat_3.2.8_20180920_122655/Windows/00check.log
> https://win-builder.r-project.org/incoming_pretest/eurostat_3.2.8_20180920_122655/Debian/00check.log
>
> I attached the time stamp listing below.

Well, maybe it is an inaccurate system clock on your machine, or on
CRAN, but nevertheless,
for this submission, you can just set the time stamps to a time in the past.

Gabor

> Leo
>
>
>
> lei@kone:~/Rpackages/eurostat/inst/extras$ tar zvtf eurostat_3.2.8.tar.gz
> drwxr-xr-x lei/lei   0 2018-09-20 13:23 eurostat/build/
> -rw-r--r-- lei/lei 301 2018-09-20 13:23 eurostat/build/vignette.rds
> drwxr-xr-x lei/lei   0 2018-09-20 13:23 eurostat/data/
> -rw-r--r-- lei/lei  98 2018-09-20 13:23 eurostat/data/datalist
> -rwxr-xr-x lei/lei 340 2018-06-17 20:44
> eurostat/data/ea_countries.rda
> -rwxr-xr-x lei/lei 214 2018-06-17 20:44
> eurostat/data/efta_countries.rda
> -rwxr-xr-x lei/lei 252 2018-06-17 20:44
> eurostat/data/eu_candidate_countries.rda
> -rwxr-xr-x lei/lei 431 2018-06-17 20:44
> eurostat/data/eu_countries.rda
> -rw-r--r-- lei/lei 3651661 2018-08-28 18:44
> eurostat/data/eurostat_geodata_60_2016.rda
> -rwxr-xr-x lei/lei5596 2018-06-17 20:44 eurostat/data/tgs00026.rda
> -rwxr-xr-x lei/lei1447 2018-09-20 13:23 eurostat/DESCRIPTION
> drwxr-xr-x lei/lei   0 2018-09-20 13:23 eurostat/inst/
> -rwxr-xr-x lei/lei1060 2018-06-17 20:44 eurostat/inst/CITATION
> drwxr-xr-x lei/lei   0 2018-09-20 13:23 eurostat/inst/doc/
> -rw-r--r-- lei/lei6361 2018-09-20 13:22
> eurostat/inst/doc/blogposts.html
> -rwxr-xr-x lei/lei 518 2018-06-17 20:44
> eurostat/inst/doc/blogposts.Rmd
> -rw-r--r-- lei/lei6455 2018-09-20 13:22
> eurostat/inst/doc/cheatsheet.html
> -rwxr-xr-x lei/lei 610 2018-06-17 20:44
> eurostat/inst/doc/cheatsheet.Rmd
> -rw-r--r-- lei/lei  246912 2018-09-20 01:43
> eurostat/inst/doc/eurostat_tutorial.pdf
> -rw-r--r-- lei/lei   10565 2018-09-20 13:23
> eurostat/inst/doc/eurostat_tutorial.R
> -rwxr-xr-x lei/lei   17177 2018-09-20 00:07
> eurostat/inst/doc/eurostat_tutorial.Rmd
> -rw-r--r-- lei/lei6834 2018-09-20 13:23
> eurostat/inst/doc/publications.html
> -rwxr-xr-x lei/lei 942 2018-06-17 20:44
> eurostat/inst/doc/publications.Rmd
> -rwxr-xr-x lei/lei  92 2018-06-17 20:44 eurostat/LICENSE
> drwxr-xr-x lei/lei   0 2018-08-28 18:44 eurostat/man/
> -rwxr-xr-x lei/lei 697 2018-06-17 20:44
> eurostat/man/clean_eurostat_cache.Rd
> -rwxr-xr-x lei/lei 395 2018-06-17 20:44
> eurostat/man/convert_time_col.Rd
> -rwxr-xr-x lei/lei1276 2018-06-17 20:44
> eurostat/man/cut_to_classes.Rd
> -rwxr-xr-x lei/lei1133 2018-06-17 20:44 eurostat/man/dic_order.Rd
> -rwxr-xr-x lei/lei 788 2018-06-17 20:44 eurostat/man/eu_countries.Rd
> -rw-r--r-- lei/lei 826 2018-08-28 18:44
> eurostat/man/eurostat_geodata_60_2016.Rd
> -rwxr-xr-x lei/lei 805 2018-06-17 20:44
> eurostat/man/eurostat-package.Rd
> -rwxr-xr-x lei/lei1220 2018-06-17 20:44
> eurostat/man/eurotime2date.Rd
> -rwxr-xr-x lei/lei 978 2018-06-17 20:44 eurostat/man/eurotime2num.Rd
> -rwxr-xr-x lei/lei1169 2018-06-17 20:44
> eurostat/man/get_eurostat_dic.Rd
> -rwxr-xr-x lei/lei2321 2018-08-28 18:44
> eurostat/man/get_eurostat_geospatial.Rd
> -rwxr-xr-x lei/lei2675 2018-08-28 18:44
> eurostat/man/get_eurostat_json.Rd
> -rwxr-xr-x lei/lei1172 2018-06-17 20:44
> eurostat/man/get_eurostat_raw.Rd
> -rwxr-xr-x lei/lei1191 2018-06-17 20:44
> eurostat/man/get_eurostat_toc.Rd
> -rwxr-xr-x lei/lei6351 2018-08-28 18:44 eurostat/man/get_eurostat.Rd
> -rwxr-xr-x lei/lei 783 2018-06-17 20:44
> eurostat/man/harmonize_country_code.Rd
> -rwxr-xr-x lei/lei2518 2018-06-17 20:44
> eurostat/man/label_eurostat.Rd
> -rwxr-xr-x lei/lei2015 2018-06-17 20:44
> eurostat/man/search_eurostat.Rd
> -rwxr-xr-x lei/lei 453 2018-06-17 20:44
> eurostat/man/set_eurostat_toc.Rd
> -rwxr-xr-x lei/lei 344 2018-06-17 20:44 eurostat/man/tgs00026.Rd
> -rwxr-xr-x lei/lei1696 2018-06-17 20:44
> eurostat/man/tidy_eurostat.Rd
> -rwxr-xr-x lei/lei1273 2018-09-20 13:21 eurostat/NAMESPACE
> drwxr-xr-x lei/lei   0 2018-09-20 13:23 eurostat/R/
> -rwxr-xr-x lei/lei1140 2018-06-17 20:44
> eurostat/R/clean_eurostat_cache.R
> -rwxr-xr-x lei/lei3191 2018-06-17 20:44 eurostat/R/cut_to_classes.R
> -rwxr-xr-x lei/lei 638 2018-06-17 20:44 eurostat/R/data_countries.R
> -rwxr-xr-x lei/lei4571 2018-08-28 18:44 euro

Re: [Rd] Bias in R's random integers?

2018-09-20 Thread Ralf Stubner
On 9/20/18 1:43 AM, Carl Boettiger wrote:
> For a well-tested C algorithm, based on my reading of Lemire, the unbiased
> "algorithm 3" in https://arxiv.org/abs/1805.10941 is part already of the C
> standard library in OpenBSD and macOS (as arc4random_uniform), and in the
> GNU standard library.  Lemire also provides C++ code in the appendix of his
> piece for both this and the faster "nearly divisionless" algorithm.
> 
> It would be excellent if any R core members were interested in considering
> bindings to these algorithms as a patch, or might express expectations for
> how that patch would have to operate (e.g. re Duncan's comment about
> non-integer arguments to sample size).  Otherwise, an R package binding
> seems like a good starting point, but I'm not the right volunteer.
It is difficult to do this in a package, since R does not provide access
to the random bits generated by the RNG. Only a float in (0,1) is
available via unif_rand(). However, if one is willing to use an external
RNG, it is of course possible. After reading about Lemire's work [1], I
had planned to integrate such an unbiased sampling scheme into the dqrng
package, which I have now started. [2]

Using Duncan's example, the results look much better:

> library(dqrng)
> m <- (2/5)*2^32
> y <- dqsample(m, 100, replace = TRUE)
> table(y %% 2)

 0  1
500252 499748

Currently I am taking the other interpretation of "truncated":

> table(dqsample(2.5, 100, replace = TRUE))

 0  1
499894 500106

I will adjust this to whatever is decided for base R.


However, there is currently neither long vector nor weighted sampling
support. And the performance without replacement is quite bad compared
to R's algorithm with hashing.

cheerio
ralf

[1] via http://www.pcg-random.org/posts/bounded-rands.html
[2] https://github.com/daqana/dqrng/tree/feature/sample

-- 
Ralf Stubner
Senior Software Engineer / Trainer

daqana GmbH
Dortustraße 48
14467 Potsdam

T: +49 331 23 61 93 11
F: +49 331 23 61 93 90
M: +49 162 20 91 196
Mail: ralf.stub...@daqana.com

Sitz: Potsdam
Register: AG Potsdam HRB 27966 P
Ust.-IdNr.: DE300072622
Geschäftsführer: Prof. Dr. Dr. Karl-Kuno Kunze



signature.asc
Description: OpenPGP digital signature
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] future time stamps warning

2018-09-20 Thread Emil Bode

On Thu, Sep 20, 2018 at 11:46 AM Leo Lahti  wrote:
>
> Time stamps are correct and my system time is correct.

How is your timezone set?
When I look at your github I see as timestamp for DESCRIPTION today, 1:25 PM 
GMT+2. (and as I'm writing this, it's 1:12 PM GMT+2)
GMT+2 is CEST, if I see your mail-adress I guess you're in Finland, EEST, GMT+3.


>
> I am now tried to use Sys.setFileTime() to update time stamps as proposed.
> This does not help.
>
> The windows and debian builds give different reports on the time stamp
> issue.
> 
https://win-builder.r-project.org/incoming_pretest/eurostat_3.2.8_20180920_122655/Windows/00check.log
> 
https://win-builder.r-project.org/incoming_pretest/eurostat_3.2.8_20180920_122655/Debian/00check.log
>
> I attached the time stamp listing below.

Well, maybe it is an inaccurate system clock on your machine, or on
CRAN, but nevertheless,
for this submission, you can just set the time stamps to a time in the past.

Gabor

> Leo
>
>
>
> lei@kone:~/Rpackages/eurostat/inst/extras$ tar zvtf eurostat_3.2.8.tar.gz
> drwxr-xr-x lei/lei   0 2018-09-20 13:23 eurostat/build/
> -rw-r--r-- lei/lei 301 2018-09-20 13:23 
eurostat/build/vignette.rds
> drwxr-xr-x lei/lei   0 2018-09-20 13:23 eurostat/data/
> -rw-r--r-- lei/lei  98 2018-09-20 13:23 eurostat/data/datalist
> -rwxr-xr-x lei/lei 340 2018-06-17 20:44
> eurostat/data/ea_countries.rda
> -rwxr-xr-x lei/lei 214 2018-06-17 20:44
> eurostat/data/efta_countries.rda
> -rwxr-xr-x lei/lei 252 2018-06-17 20:44
> eurostat/data/eu_candidate_countries.rda
> -rwxr-xr-x lei/lei 431 2018-06-17 20:44
> eurostat/data/eu_countries.rda
> -rw-r--r-- lei/lei 3651661 2018-08-28 18:44
> eurostat/data/eurostat_geodata_60_2016.rda
> -rwxr-xr-x lei/lei5596 2018-06-17 20:44 eurostat/data/tgs00026.rda
> -rwxr-xr-x lei/lei1447 2018-09-20 13:23 eurostat/DESCRIPTION
> drwxr-xr-x lei/lei   0 2018-09-20 13:23 eurostat/inst/
> -rwxr-xr-x lei/lei1060 2018-06-17 20:44 eurostat/inst/CITATION
> drwxr-xr-x lei/lei   0 2018-09-20 13:23 eurostat/inst/doc/
> -rw-r--r-- lei/lei6361 2018-09-20 13:22
> eurostat/inst/doc/blogposts.html
> -rwxr-xr-x lei/lei 518 2018-06-17 20:44
> eurostat/inst/doc/blogposts.Rmd
> -rw-r--r-- lei/lei6455 2018-09-20 13:22
> eurostat/inst/doc/cheatsheet.html
> -rwxr-xr-x lei/lei 610 2018-06-17 20:44
> eurostat/inst/doc/cheatsheet.Rmd
> -rw-r--r-- lei/lei  246912 2018-09-20 01:43
> eurostat/inst/doc/eurostat_tutorial.pdf
> -rw-r--r-- lei/lei   10565 2018-09-20 13:23
> eurostat/inst/doc/eurostat_tutorial.R
> -rwxr-xr-x lei/lei   17177 2018-09-20 00:07
> eurostat/inst/doc/eurostat_tutorial.Rmd
> -rw-r--r-- lei/lei6834 2018-09-20 13:23
> eurostat/inst/doc/publications.html
> -rwxr-xr-x lei/lei 942 2018-06-17 20:44
> eurostat/inst/doc/publications.Rmd
> -rwxr-xr-x lei/lei  92 2018-06-17 20:44 eurostat/LICENSE
> drwxr-xr-x lei/lei   0 2018-08-28 18:44 eurostat/man/
> -rwxr-xr-x lei/lei 697 2018-06-17 20:44
> eurostat/man/clean_eurostat_cache.Rd
> -rwxr-xr-x lei/lei 395 2018-06-17 20:44
> eurostat/man/convert_time_col.Rd
> -rwxr-xr-x lei/lei1276 2018-06-17 20:44
> eurostat/man/cut_to_classes.Rd
> -rwxr-xr-x lei/lei1133 2018-06-17 20:44 eurostat/man/dic_order.Rd
> -rwxr-xr-x lei/lei 788 2018-06-17 20:44 
eurostat/man/eu_countries.Rd
> -rw-r--r-- lei/lei 826 2018-08-28 18:44
> eurostat/man/eurostat_geodata_60_2016.Rd
> -rwxr-xr-x lei/lei 805 2018-06-17 20:44
> eurostat/man/eurostat-package.Rd
> -rwxr-xr-x lei/lei1220 2018-06-17 20:44
> eurostat/man/eurotime2date.Rd
> -rwxr-xr-x lei/lei 978 2018-06-17 20:44 
eurostat/man/eurotime2num.Rd
> -rwxr-xr-x lei/lei1169 2018-06-17 20:44
> eurostat/man/get_eurostat_dic.Rd
> -rwxr-xr-x lei/lei2321 2018-08-28 18:44
> eurostat/man/get_eurostat_geospatial.Rd
> -rwxr-xr-x lei/lei2675 2018-08-28 18:44
> eurostat/man/get_eurostat_json.Rd
> -rwxr-xr-x lei/lei1172 2018-06-17 20:44
> eurostat/man/get_eurostat_raw.Rd
> -rwxr-xr-x lei/lei1191 2018-06-17 20:44
> eurostat/man/get_eurostat_toc.Rd
> -rwxr-xr-x lei/lei6351 2018-08-28 18:44 
eurostat/man/get_eurostat.Rd
> -rwxr-xr-x lei/lei 783 2018-06-17 20:44
> eurostat/man/harmonize_country_code.Rd
> -rwxr-xr-x lei/lei2518 2018-06-17 20:44
> eurostat/man/label_eurostat.Rd
> -rwxr-xr-x lei/lei2015 2018-06-17 20:44
> eurostat/man/search_eurostat.Rd
> -rwxr-xr-x lei/lei 453

Re: [Rd] future time stamps warning

2018-09-20 Thread Jari Oksanen
Could this be a timezone issue (setting the timezone in local computer 
and communicating this to CRAN): when I look at the email in my computer 
I see:



On Thu, Sep 20, 2018 at 11:46 AM Leo Lahti  wrote:

-rwxr-xr-x lei/lei1447 2018-09-20 13:23 eurostat/DESCRIPTION


Which seems to claim that eurostats/DESCRIPTION was nearly three hours 
younger than the email. This clearly was in the future back then.


If so, waiting a couple of hours before submission could help, and there 
should be an optimal solution, too (i.e., CRAN and you communicate the 
timezone or both use the same like UTC).


Cheers, Jari Oksanen

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Bias in R's random integers?

2018-09-20 Thread Radford Neal
> From: Duncan Murdoch 

> Let's try it:
> 
>  > m <- (2/5)*2^32
>  > m > 2^31
> [1] FALSE
>  > x <- sample(m, 100, replace = TRUE)
>  > table(x %% 2)
> 
>   0  1
> 399850 600150
> 
> Since m is an even number, the true proportions of evens and odds should 
> be exactly 0.5.  That's some pretty strong evidence of the bug in the 
> generator. 


It seems to be a recently-introduced bug.  Here's output with R-2.15.1:

  R version 2.15.1 (2012-06-22) -- "Roasted Marshmallows"
  Copyright (C) 2012 The R Foundation for Statistical Computing
  ISBN 3-900051-07-0
  Platform: x86_64-unknown-linux-gnu (64-bit)
  
  R is free software and comes with ABSOLUTELY NO WARRANTY.
  You are welcome to redistribute it under certain conditions.
  Type 'license()' or 'licence()' for distribution details.
  
  R is a collaborative project with many contributors.
  Type 'contributors()' for more information and
  'citation()' on how to cite R or R packages in publications.
  
  Type 'demo()' for some demos, 'help()' for on-line help, or
  'help.start()' for an HTML browser interface to help.
  Type 'q()' to quit R.
  
  > set.seed(123)
  > m <- (2/5)*2^32
  > m > 2^31
  [1] FALSE
  > x <- sample(m, 100, replace = TRUE)
  > table(x %% 2)
  
   0  1 
  499412 500588 
  
So I doubt that this has anything to do with bias from using 32-bit
random values.

   Radford Neal

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] segfault issue with parallel::mclapply and download.file() on Mac OS X

2018-09-20 Thread Seth Russell
Thanks for the warning about fork without exec(). A co-worker of mine, also
on Mac, ran the sample code and got an error about that exact problem.

Thanks also for the pointer to try curl::multi_add() or download.file()
with a vector of files.

My actual use case includes downloading the files and then untar() for
analysis of files contained in the tar.gz file. I'm currently parallelizing
both the download and untar operation and found that using a parallel form
of lapply resulted in 4x - 8x improvement depending on hardware, network
latency, etc. I'll see how much of that improvement can be attributed to
I/O multiplexing for the downloading portion using your recommendations.

Seth

Trimmed reply from Gábor Csárdi :

>
> Fork without exec is not supported by macOS, basically any calls to
> system libraries might crash. (Ie. not just HTTP-related calls.) For
> HTTP calls I have seen errors, crashes, and sometimes it works.
> Depends on the combination of libcurl version, macOS version and
> probably luck.
>
> It usually (always?) works on Linux, but I would not rely on that, either.
>
> So, yes, this is a known issue.
>
> Creating new processes to perform HTTP in parallel is very often bad
> practice, actually. Whenever you can, use I/O multiplexing instead,
> since the main R process is not doing anything, anyway, just waiting
> for the data to come in. So you don't need more processes, you need
> parallel I/O. Take a look at the curl::multi_add() etc. functions.
>
> Btw. download.file() can actually download files in parallel if the
> liburl method is used, just give it a list of URLs in a character
> vector. This API is very restricted, though, so I suggest to look at
> the curl package.
>
>
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Bias in R's random integers?

2018-09-20 Thread Duncan Murdoch

On 20/09/2018 6:59 AM, Ralf Stubner wrote:

On 9/20/18 1:43 AM, Carl Boettiger wrote:

For a well-tested C algorithm, based on my reading of Lemire, the unbiased
"algorithm 3" in https://arxiv.org/abs/1805.10941 is part already of the C
standard library in OpenBSD and macOS (as arc4random_uniform), and in the
GNU standard library.  Lemire also provides C++ code in the appendix of his
piece for both this and the faster "nearly divisionless" algorithm.

It would be excellent if any R core members were interested in considering
bindings to these algorithms as a patch, or might express expectations for
how that patch would have to operate (e.g. re Duncan's comment about
non-integer arguments to sample size).  Otherwise, an R package binding
seems like a good starting point, but I'm not the right volunteer.

It is difficult to do this in a package, since R does not provide access
to the random bits generated by the RNG. Only a float in (0,1) is
available via unif_rand(). 


I believe it is safe to multiply the unif_rand() value by 2^32, and take 
the whole number part as an unsigned 32 bit integer.  Depending on the 
RNG in use, that will give at least 25 random bits.  (The low order bits 
are the questionable ones.  25 is just a guess, not a guarantee.)


However, if one is willing to use an external

RNG, it is of course possible. After reading about Lemire's work [1], I
had planned to integrate such an unbiased sampling scheme into the dqrng
package, which I have now started. [2]

Using Duncan's example, the results look much better:


library(dqrng)
m <- (2/5)*2^32
y <- dqsample(m, 100, replace = TRUE)
table(y %% 2)


  0  1
500252 499748


Another useful diagnostic is

  plot(density(y[y %% 2 == 0]))

Obviously that should give a more or less uniform density, but for 
values near m, the default sample() gives some nice pretty pictures of 
quite non-uniform densities.


By the way, there are actually quite a few examples of very large m 
besides m = (2/5)*2^32 where performance of sample() is noticeably bad. 
You'll see problems in y %% 2 for any integer a > 1 with m = 2/(1 + 2a) 
* 2^32, problems in y %% 3 for m = 3/(1 + 3a)*2^32 or m = 3/(2 + 
3a)*2^32, etc.


So perhaps I'm starting to be convinced that the default sample() should 
be fixed.


Duncan Murdoch




Currently I am taking the other interpretation of "truncated":


table(dqsample(2.5, 100, replace = TRUE))


  0  1
499894 500106

I will adjust this to whatever is decided for base R.


However, there is currently neither long vector nor weighted sampling
support. And the performance without replacement is quite bad compared
to R's algorithm with hashing.

cheerio
ralf

[1] via http://www.pcg-random.org/posts/bounded-rands.html
[2] https://github.com/daqana/dqrng/tree/feature/sample



__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel



__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] A different error in sample()

2018-09-20 Thread Martin Maechler
> Martin Maechler 
> on Thu, 20 Sep 2018 09:20:46 +0200 writes:

> Wolfgang Huber 
> on Thu, 20 Sep 2018 08:47:47 +0200 writes:

>> FWIW, I suspect this is related to the function
>> R_unif_index that was introduced in src/main/RNG.c around
>> revision 72356, or the way this function is used in
>> do_sample in src/main/random.c.

> Yes, it is just the use of 'dn' instead of 'n'
> - a one letter thinko I'd say.

> But *no*, it's much older than revision 72356; e.g., it's already in

> R version 3.0.0 (2013-04-03) -- "Masked Marvel"

> but not yet in

> R version 2.15.3 (2013-03-01) -- "Security Blanket"

> 

> Here, I clearly think we see a regression bug, and hopefully not
> one that should trigger often in practice...
> and  -- without any statistics about the consequences out in
> package space --
> I do think we should fix this in code and let the documentation
> become "great again" ;-)

We have agreed that this is simply a regression and should be
fixed without a change to the documenation.

Consequently, ~ 5 minutes ago

$ svn log -v -c75338

r75338 | maechler | 2018-09-20 17:38:46 +0200 (Thu, 20 Sep 2018) | 1 line
Changed paths:
   M /trunk/doc/NEWS.Rd
   M /trunk/src/main/random.c
   M /trunk/tests/reg-tests-1d.R

revert sample.int(, k, replace=TRUE) to sane pre_R-3.0.0 behaviour


This is now back to "correct" behaviour  in  "R-devel (>= 75338)"

(and, as Duncan Murdoch also said by choosing this thread's
 Subject, this is really a different issue than the  "Bias in R's")

Martin

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Bias in R's random integers?

2018-09-20 Thread Duncan Murdoch

On 20/09/2018 11:01 AM, Radford Neal wrote:

From: Duncan Murdoch 



Let's try it:

  > m <- (2/5)*2^32
  > m > 2^31
[1] FALSE
  > x <- sample(m, 100, replace = TRUE)
  > table(x %% 2)

   0  1
399850 600150

Since m is an even number, the true proportions of evens and odds should
be exactly 0.5.  That's some pretty strong evidence of the bug in the
generator.



It seems to be a recently-introduced bug.  Here's output with R-2.15.1:

   R version 2.15.1 (2012-06-22) -- "Roasted Marshmallows"
   Copyright (C) 2012 The R Foundation for Statistical Computing
   ISBN 3-900051-07-0
   Platform: x86_64-unknown-linux-gnu (64-bit)
   
   R is free software and comes with ABSOLUTELY NO WARRANTY.

   You are welcome to redistribute it under certain conditions.
   Type 'license()' or 'licence()' for distribution details.
   
   R is a collaborative project with many contributors.

   Type 'contributors()' for more information and
   'citation()' on how to cite R or R packages in publications.
   
   Type 'demo()' for some demos, 'help()' for on-line help, or

   'help.start()' for an HTML browser interface to help.
   Type 'q()' to quit R.
   
   > set.seed(123)

   > m <- (2/5)*2^32
   > m > 2^31
   [1] FALSE
   > x <- sample(m, 100, replace = TRUE)
   > table(x %% 2)
   
0  1

   499412 500588
   
So I doubt that this has anything to do with bias from using 32-bit

random values.


There are two bugs, one recent one (allowing fractional m to slip 
through) and one old one (the bias).  The test above shows the bias with 
m+1, i.e. in R 2.15.1


> x <- sample(m + 1, 100, replace = TRUE)
> table(x %% 2)

 0  1
466739 533261


and you can see it in the original m with

  x <- sample(m, 100, replace = TRUE)
  plot(density(x[x %% 2 == 0]))

Duncan Murdoch

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Bias in R's random integers?

2018-09-20 Thread Paul Gilbert

On 09/19/2018 10:03 AM, Ben Bolker wrote:
...

Balancing backward compatibility and correctness is a tough problem
here.  


I think improvements in the RNG is a situation where backward 
compatibility is not really going to be lost, because people can specify 
the old generator, they just will not get it by default. My opinion is 
that the default needs to generally be the best option available because 
too many people will be expecting that, or not know better, in which 
case that is what they should get.


There are only two small problems that occur to me:

1/ Researchers that want to have reproducible results (all I hope) need 
to be aware the change has happened. In theory they should have recorded 
the RNG they were using, along with the seed (and, BTW, the number of 
nodes if they generate with a parallel generator). If they have not done 
that then they can figure out the RNG from knowing what version of R 
they used. If they haven't recorded that then they can figure it out by 
some experimentation and knowing roughly when they did the research. If 
none of this works then the research probably should be lost.


As an exercise, researchers might also want to experiment with whether 
the new default qualitatively changes their results. That might lead to 
publishable research, so no one should complain.


2/ Package maintainers that have used the default RNG to generate tests 
may need to change their tests to specify the old generator, or modify 
results used for comparisons in the tests. Since package testing is 
usually for code checking rather than statistical results, not using the 
best available generator is not usually an issue.


Most of my own package testing already specifies the generator, lots 
uses "buggy Kinderman-Ramage" because tests were set up a long time ago. 
I will have to change package setRNG which warns when the default 
generator changes. (This warning is intentional because I was bitten 
badly by a small change in the S generator circa 1990.)




If this goes into base R, what's the best way to do it?  What was
the protocol for migrating away from the "buggy Kinderman-Ramage"
generator, back in the day?   (Version 1.7 was sometime between 2001 and
2004).


I think there may have been a change in R 0.99 too. At least my notes 
suggest that the code I changed for  R 1.7.0 had worked with the default 
generator from R 0.99 to 1.6.2.


I don't recall the protocol, I think it just happened and was announced 
in the NEWS. (Has this protocol changed?) The ramification for me was 
that I had to go through all of my packages' testing and change the name 
of the explicitly specified RNG to "buggy Kinderman-Ramage".


Perhaps there does need to be a protocol for testing before release. 
When my package setRNG fails then many of my other packages will also 
fail because they depend on it. This is a simple fix but reverse 
dependencies may make it look like lots of things are broken.


Paul Gilbert


   I couldn't find the exact commit in the GitHub mirror: this is related ...

https://github.com/wch/r-source/commit/7ad3044639fd1fe093c655e573fd1a67aa7f55f6#diff-dbcad570d4fb9b7005550ff630543b37



===
‘normal.kind’ can be ‘"Kinderman-Ramage"’, ‘"Buggy
  Kinderman-Ramage"’ (not for ‘set.seed’), ‘"Ahrens-Dieter"’,
  ‘"Box-Muller"’, ‘"Inversion"’ (the default), or ‘"user-supplied"’.
  (For inversion, see the reference in ‘qnorm’.)  The
  Kinderman-Ramage generator used in versions prior to 1.7.0 (now
  called ‘"Buggy"’) had several approximation errors and should only
  be used for reproduction of old results.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel



__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Bias in R's random integers?

2018-09-20 Thread Gabe Becker
Hi all,

On Thu, Sep 20, 2018 at 9:30 AM, Paul Gilbert  wrote:
>
>
> There are only two small problems that occur to me:
>
> 1/ Researchers that want to have reproducible results (all I hope) need to
> be aware the change has happened. In theory they should have recorded the
> RNG they were using, along with the seed (and, BTW, the number of nodes if
> they generate with a parallel generator). If they have not done that then
> they can figure out the RNG from knowing what version of R they used. If
> they haven't recorded that then they can figure it out by some
> experimentation and knowing roughly when they did the research. If none of
> this works then the research probably should be lost.
>
> As an exercise, researchers might also want to experiment with whether the
> new default qualitatively changes their results. That might lead to
> publishable research, so no one should complain.
>

I was going to suggest helper/convenience functions for this but looking at
?RNG I see that someone has already put in RNGversion which *sets* the RNG
kind to what was used by default in an older version. I do wonder if there
is still value in a function that would *return* it, e.g. for comparisons.
Perhaps RNGversionstr?

Also, would R-core members be interested in a small patch to sessionInfo()
and print.sessionInfo() which makes it so that the current RNGkind is
captured and displayed (respectively) by the sessionInfo machinery? I can
prepare one if so.

Best,
~G

-- 
Gabriel Becker, Ph.D
Scientist
Bioinformatics and Computational Biology
Genentech Research

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Bias in R's random integers?

2018-09-20 Thread Hervé Pagès

Hi,

Note that it wouldn't be the first time that sample() changes behavior
in a non-backward compatible way:

  https://stat.ethz.ch/pipermail/r-devel/2012-October/065049.html

Cheers,
H.


On 09/20/2018 08:15 AM, Duncan Murdoch wrote:

On 20/09/2018 6:59 AM, Ralf Stubner wrote:

On 9/20/18 1:43 AM, Carl Boettiger wrote:
For a well-tested C algorithm, based on my reading of Lemire, the 
unbiased
"algorithm 3" in 
https://urldefense.proofpoint.com/v2/url?u=https-3A__arxiv.org_abs_1805.10941&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=tVt5ARiRzaOYr7BgOc0nC_hDq80BUkAUKNwcowN5W1k&s=TtofIDsvWasBZGzOl9J0kBQnJMksr2Rg3u1l8CM5-qE&e= 
is part already of the C
standard library in OpenBSD and macOS (as arc4random_uniform), and in 
the
GNU standard library.  Lemire also provides C++ code in the appendix 
of his

piece for both this and the faster "nearly divisionless" algorithm.

It would be excellent if any R core members were interested in 
considering
bindings to these algorithms as a patch, or might express 
expectations for

how that patch would have to operate (e.g. re Duncan's comment about
non-integer arguments to sample size).  Otherwise, an R package binding
seems like a good starting point, but I'm not the right volunteer.

It is difficult to do this in a package, since R does not provide access
to the random bits generated by the RNG. Only a float in (0,1) is
available via unif_rand(). 


I believe it is safe to multiply the unif_rand() value by 2^32, and take 
the whole number part as an unsigned 32 bit integer.  Depending on the 
RNG in use, that will give at least 25 random bits.  (The low order bits 
are the questionable ones.  25 is just a guess, not a guarantee.)


However, if one is willing to use an external

RNG, it is of course possible. After reading about Lemire's work [1], I
had planned to integrate such an unbiased sampling scheme into the dqrng
package, which I have now started. [2]

Using Duncan's example, the results look much better:


library(dqrng)
m <- (2/5)*2^32
y <- dqsample(m, 100, replace = TRUE)
table(y %% 2)


  0  1
500252 499748


Another useful diagnostic is

   plot(density(y[y %% 2 == 0]))

Obviously that should give a more or less uniform density, but for 
values near m, the default sample() gives some nice pretty pictures of 
quite non-uniform densities.


By the way, there are actually quite a few examples of very large m 
besides m = (2/5)*2^32 where performance of sample() is noticeably bad. 
You'll see problems in y %% 2 for any integer a > 1 with m = 2/(1 + 2a) 
* 2^32, problems in y %% 3 for m = 3/(1 + 3a)*2^32 or m = 3/(2 + 
3a)*2^32, etc.


So perhaps I'm starting to be convinced that the default sample() should 
be fixed.


Duncan Murdoch




Currently I am taking the other interpretation of "truncated":


table(dqsample(2.5, 100, replace = TRUE))


  0  1
499894 500106

I will adjust this to whatever is decided for base R.


However, there is currently neither long vector nor weighted sampling
support. And the performance without replacement is quite bad compared
to R's algorithm with hashing.

cheerio
ralf

[1] via 
https://urldefense.proofpoint.com/v2/url?u=http-3A__www.pcg-2Drandom.org_posts_bounded-2Drands.html&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=tVt5ARiRzaOYr7BgOc0nC_hDq80BUkAUKNwcowN5W1k&s=OlX-dzwoOeFlod3Gofa_1TQaZwmjsCH9C9v3lM5Y2rY&e= 

[2] 
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_daqana_dqrng_tree_feature_sample&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=tVt5ARiRzaOYr7BgOc0nC_hDq80BUkAUKNwcowN5W1k&s=DNaSqRCy89Hvbg1G0SpyEL0kkr9_RqWXi9pTy75V32M&e= 





__
R-devel@r-project.org mailing list
https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Ddevel&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=tVt5ARiRzaOYr7BgOc0nC_hDq80BUkAUKNwcowN5W1k&s=WOx4NyeYmWxpDG3tBRQ9-_Y3_7YAlKUKOP6gZLs0BrQ&e= 





__
R-devel@r-project.org mailing list
https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Ddevel&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=tVt5ARiRzaOYr7BgOc0nC_hDq80BUkAUKNwcowN5W1k&s=WOx4NyeYmWxpDG3tBRQ9-_Y3_7YAlKUKOP6gZLs0BrQ&e= 



--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fredhutch.org
Phone:  (206) 667-5791
Fax:(206) 667-1319

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Bias in R's random integers?

2018-09-20 Thread Steve Grubb
Hello,

On Thursday, September 20, 2018 11:15:04 AM EDT Duncan Murdoch wrote:
> On 20/09/2018 6:59 AM, Ralf Stubner wrote:
> > On 9/20/18 1:43 AM, Carl Boettiger wrote:
> >> For a well-tested C algorithm, based on my reading of Lemire, the
> >> unbiased "algorithm 3" in https://arxiv.org/abs/1805.10941 is part
> >> already of the C standard library in OpenBSD and macOS (as
> >> arc4random_uniform), and in the GNU standard library.  Lemire also
> >> provides C++ code in the appendix of his piece for both this and the
> >> faster "nearly divisionless" algorithm.
> >> 
> >> It would be excellent if any R core members were interested in
> >> considering bindings to these algorithms as a patch, or might express
> >> expectations for how that patch would have to operate (e.g. re Duncan's
> >> comment about non-integer arguments to sample size).  Otherwise, an R
> >> package binding seems like a good starting point, but I'm not the right
> >> volunteer.
> > 
> > It is difficult to do this in a package, since R does not provide access
> > to the random bits generated by the RNG. Only a float in (0,1) is
> > available via unif_rand().
> 
> I believe it is safe to multiply the unif_rand() value by 2^32, and take
> the whole number part as an unsigned 32 bit integer.  Depending on the
> RNG in use, that will give at least 25 random bits.  (The low order bits
> are the questionable ones.  25 is just a guess, not a guarantee.)
> 
> However, if one is willing to use an external
> 
> > RNG, it is of course possible. After reading about Lemire's work [1], I
> > had planned to integrate such an unbiased sampling scheme into the dqrng
> > package, which I have now started. [2]
> > 
> > Using Duncan's example, the results look much better:
> >> library(dqrng)
> >> m <- (2/5)*2^32
> >> y <- dqsample(m, 100, replace = TRUE)
> >> table(y %% 2)
> >> 
> >   0  1
> > 
> > 500252 499748
> 
> Another useful diagnostic is
> 
>plot(density(y[y %% 2 == 0]))
> 
> Obviously that should give a more or less uniform density, but for
> values near m, the default sample() gives some nice pretty pictures of
> quite non-uniform densities.
> 
> By the way, there are actually quite a few examples of very large m
> besides m = (2/5)*2^32 where performance of sample() is noticeably bad.
> You'll see problems in y %% 2 for any integer a > 1 with m = 2/(1 + 2a)
> * 2^32, problems in y %% 3 for m = 3/(1 + 3a)*2^32 or m = 3/(2 +
> 3a)*2^32, etc.
> 
> So perhaps I'm starting to be convinced that the default sample() should
> be fixed.

I find this discussion fascinating. I normally test random numbers in 
different languages every now and again using various methods. One simple 
check that I do is to use Michal Zalewski's method when he studied Strange 
Attractors and Initial TCP/IP Sequence Numbers:

http://lcamtuf.coredump.cx/newtcp/
https://pdfs.semanticscholar.org/
adb7/069984e3fa48505cd5081ec118ccb95529a3.pdf

The technique works by mapping the dynamics of the generated numbers into a 
three-dimensional phase space. This is then plotted in a graph so that you 
can visually see if something odd is going on.

I used   runif(1, min = 0, max = 65535)  to get a set of numbers. This is 
the resulting plot that was generated from R's numbers using this technique:

http://people.redhat.com/sgrubb/files/r-random.jpg

And for comparison this was generated by collecting the same number of 
samples from the bash shell:

http://people.redhat.com/sgrubb/files/bash-random.jpg

The net result is that it shows some banding in the R generated random 
numbers where bash has uniform random numbers with no discernible pattern.

Best Regards,
-Steve

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Bias in R's random integers?

2018-09-20 Thread Philip B. Stark
The same issue occurs in walker_ProbSampleReplace() in random.c, lines
386-387:

rU = unif_rand() * n;
k = (int) rU;

Cheers,
Philip

On Wed, Sep 19, 2018 at 3:08 PM Duncan Murdoch 
wrote:

> On 19/09/2018 5:57 PM, David Hugh-Jones wrote:
> >
> > It doesn't seem too hard to come up with plausible ways in which this
> > could give bad results. Suppose I sample rows from a large dataset,
> > maybe for bootstrapping. Suppose the rows are non-randomly ordered, e.g.
> > odd rows are males, even rows are females. Oops! Very non-representative
> > sample, bootstrap p values are garbage.
>
> That would only happen if your dataset was exactly 1717986918 elements
> in size. (And in fact, it will be less extreme than I posted:  I had x
> set to 1717986918.4, as described in another thread.  If you use an
> integer value you need a different pattern; add or subtract an element
> or two and the pattern needed to see a problem changes drastically.)
>
> But if you're sampling from a dataset of that exact size, then you
> should worry about this bug. Don't use sample().  Use the algorithm that
> Carl described.
>
> Duncan Murdoch
>
> >
> > David
> >
> > On Wed, 19 Sep 2018 at 21:20, Duncan Murdoch  > > wrote:
> >
> > On 19/09/2018 3:52 PM, Philip B. Stark wrote:
> >  > Hi Duncan--
> >  >
> >  >
> >
> > That is a mathematically true statement, but I suspect it is not very
> > relevant.  Pseudo-random number generators always have test functions
> > whose sample averages are quite different from the expectation under
> > the
> > true distribution.  Remember Von Neumann's "state of sin" quote.  The
> > bug in sample() just means it is easier to find such a function than
> it
> > would otherwise be.
> >
> > The practical question is whether such a function is likely to arise
> in
> > practice or not.
> >
> >   > Whether those correspond to commonly used statistics or not, I
> > have no
> >   > idea.
> >
> > I am pretty confident that this bug rarely matters.
> >
> >  > Regarding backwards compatibility: as a user, I'd rather the
> default
> >  > sample() do the best possible thing, and take an extra step to use
> >  > something like sample(..., legacy=TRUE) if I want to reproduce
> > old results.
> >
> > I suspect there's a good chance the bug I discovered today
> (non-integer
> > x values not being truncated) will be declared to be a feature, and
> the
> > documentation will be changed.  Then the rejection sampling approach
> > would need to be quite a bit more complicated.
> >
> > I think a documentation warning about the accuracy of sampling
> > probabilities would also be a sufficient fix here, and would be
> quite a
> > bit less trouble than changing the default sample().  But as I said
> in
> > my original post, a contribution of a function without this bug
> > would be
> > a nice addition.
> >
> > Duncan Murdoch
> >
> >  >
> >  > Regards,
> >  > Philip
> >  >
> >  > On Wed, Sep 19, 2018 at 9:50 AM Duncan Murdoch
> > mailto:murdoch.dun...@gmail.com>
> >  >  > >> wrote:
> >  >
> >  > On 19/09/2018 12:23 PM, Philip B. Stark wrote:
> >  >  > No, the 2nd call only happens when m > 2**31. Here's the
> code:
> >  >
> >  > Yes, you're right. Sorry!
> >  >
> >  > So the ratio really does come close to 2.  However, the
> > difference in
> >  > probabilities between outcomes is still at most 2^-32 when m
> > is less
> >  > than that cutoff.  That's not feasible to detect; the only
> > detectable
> >  > difference would happen if some event was constructed to hold
> an
> >  > abundance of outcomes with especially low (or especially high)
> >  > probability.
> >  >
> >  > As I said in my original post, it's probably not hard to
> > construct such
> >  > a thing, but as I've said more recently, it probably wouldn't
> > happen by
> >  > chance.  Here's one attempt to do it:
> >  >
> >  > Call the values from unif_rand() "the unif_rand() outcomes".
> > Call the
> >  > values from sample() the sample outcomes.
> >  >
> >  > It would be easiest to see the error if half of the sample()
> > outcomes
> >  > used two unif_rand() outcomes, and half used just one.  That
> > would mean
> >  > m should be (2/3) * 2^32, but that's too big and would
> > trigger the
> >  > other
> >  > version.
> >  >
> >  > So how about half use 2 unif_rands(), and half use 3?  That
> > means m =
> >  > (2/5) * 2^32 = 1717986918.  A good guess is that sample()
> > outcomes
> >  > would
> >  > alternate between the two possibilities, so our event could
> >