Re: [Rd] Bias in R's random integers?

Hervé Pagès Thu, 20 Sep 2018 10:57:21 -0700

Hi,

Note that it wouldn't be the first time that sample() changes behavior
in a non-backward compatible way:


  https://stat.ethz.ch/pipermail/r-devel/2012-October/065049.html

Cheers,
H.


On 09/20/2018 08:15 AM, Duncan Murdoch wrote:

On 20/09/2018 6:59 AM, Ralf Stubner wrote:
On 9/20/18 1:43 AM, Carl Boettiger wrote:
For a well-tested C algorithm, based on my reading of Lemire, theunbiased"algorithm 3" inhttps://urldefense.proofpoint.com/v2/url?u=https-3A__arxiv.org_abs_1805.10941&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=tVt5ARiRzaOYr7BgOc0nC_hDq80BUkAUKNwcowN5W1k&s=TtofIDsvWasBZGzOl9J0kBQnJMksr2Rg3u1l8CM5-qE&e=is part already of the Cstandard library in OpenBSD and macOS (as arc4random_uniform), and intheGNU standard library. Lemire also provides C++ code in the appendixof his
piece for both this and the faster "nearly divisionless" algorithm.
It would be excellent if any R core members were interested inconsideringbindings to these algorithms as a patch, or might expressexpectations for
how that patch would have to operate (e.g. re Duncan's comment about
non-integer arguments to sample size).  Otherwise, an R package binding
seems like a good starting point, but I'm not the right volunteer.
It is difficult to do this in a package, since R does not provide access
to the random bits generated by the RNG. Only a float in (0,1) is
available via unif_rand().
I believe it is safe to multiply the unif_rand() value by 2^32, and takethe whole number part as an unsigned 32 bit integer. Depending on theRNG in use, that will give at least 25 random bits. (The low order bitsare the questionable ones. 25 is just a guess, not a guarantee.)
However, if one is willing to use an external
RNG, it is of course possible. After reading about Lemire's work [1], I
had planned to integrate such an unbiased sampling scheme into the dqrng
package, which I have now started. [2]

Using Duncan's example, the results look much better:
library(dqrng)
m <- (2/5)*2^32
y <- dqsample(m, 1000000, replace = TRUE)
table(y %% 2)
      0      1
500252 499748
Another useful diagnostic is

   plot(density(y[y %% 2 == 0]))
Obviously that should give a more or less uniform density, but forvalues near m, the default sample() gives some nice pretty pictures ofquite non-uniform densities.
By the way, there are actually quite a few examples of very large mbesides m = (2/5)*2^32 where performance of sample() is noticeably bad.You'll see problems in y %% 2 for any integer a > 1 with m = 2/(1 + 2a)* 2^32, problems in y %% 3 for m = 3/(1 + 3a)*2^32 or m = 3/(2 +3a)*2^32, etc.
So perhaps I'm starting to be convinced that the default sample() shouldbe fixed.
Duncan Murdoch
Currently I am taking the other interpretation of "truncated":
table(dqsample(2.5, 1000000, replace = TRUE))
      0      1
499894 500106

I will adjust this to whatever is decided for base R.


However, there is currently neither long vector nor weighted sampling
support. And the performance without replacement is quite bad compared
to R's algorithm with hashing.

cheerio
ralf
[1] viahttps://urldefense.proofpoint.com/v2/url?u=http-3A__www.pcg-2Drandom.org_posts_bounded-2Drands.html&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=tVt5ARiRzaOYr7BgOc0nC_hDq80BUkAUKNwcowN5W1k&s=OlX-dzwoOeFlod3Gofa_1TQaZwmjsCH9C9v3lM5Y2rY&e=[2]https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_daqana_dqrng_tree_feature_sample&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=tVt5ARiRzaOYr7BgOc0nC_hDq80BUkAUKNwcowN5W1k&s=DNaSqRCy89Hvbg1G0SpyEL0kkr9_RqWXi9pTy75V32M&e=
______________________________________________
R-devel@r-project.org mailing list
https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Ddevel&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=tVt5ARiRzaOYr7BgOc0nC_hDq80BUkAUKNwcowN5W1k&s=WOx4NyeYmWxpDG3tBRQ9-_Y3_7YAlKUKOP6gZLs0BrQ&e=
______________________________________________
R-devel@r-project.org mailing list
https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Ddevel&d=DwICAg&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=tVt5ARiRzaOYr7BgOc0nC_hDq80BUkAUKNwcowN5W1k&s=WOx4NyeYmWxpDG3tBRQ9-_Y3_7YAlKUKOP6gZLs0BrQ&e=


--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Bias in R's random integers?

Reply via email to