Hello, On Thursday, September 20, 2018 11:15:04 AM EDT Duncan Murdoch wrote: > On 20/09/2018 6:59 AM, Ralf Stubner wrote: > > On 9/20/18 1:43 AM, Carl Boettiger wrote: > >> For a well-tested C algorithm, based on my reading of Lemire, the > >> unbiased "algorithm 3" in https://arxiv.org/abs/1805.10941 is part > >> already of the C standard library in OpenBSD and macOS (as > >> arc4random_uniform), and in the GNU standard library. Lemire also > >> provides C++ code in the appendix of his piece for both this and the > >> faster "nearly divisionless" algorithm. > >> > >> It would be excellent if any R core members were interested in > >> considering bindings to these algorithms as a patch, or might express > >> expectations for how that patch would have to operate (e.g. re Duncan's > >> comment about non-integer arguments to sample size). Otherwise, an R > >> package binding seems like a good starting point, but I'm not the right > >> volunteer. > > > > It is difficult to do this in a package, since R does not provide access > > to the random bits generated by the RNG. Only a float in (0,1) is > > available via unif_rand(). > > I believe it is safe to multiply the unif_rand() value by 2^32, and take > the whole number part as an unsigned 32 bit integer. Depending on the > RNG in use, that will give at least 25 random bits. (The low order bits > are the questionable ones. 25 is just a guess, not a guarantee.) > > However, if one is willing to use an external > > > RNG, it is of course possible. After reading about Lemire's work [1], I > > had planned to integrate such an unbiased sampling scheme into the dqrng > > package, which I have now started. [2] > > > > Using Duncan's example, the results look much better: > >> library(dqrng) > >> m <- (2/5)*2^32 > >> y <- dqsample(m, 1000000, replace = TRUE) > >> table(y %% 2) > >> > > 0 1 > > > > 500252 499748 > > Another useful diagnostic is > > plot(density(y[y %% 2 == 0])) > > Obviously that should give a more or less uniform density, but for > values near m, the default sample() gives some nice pretty pictures of > quite non-uniform densities. > > By the way, there are actually quite a few examples of very large m > besides m = (2/5)*2^32 where performance of sample() is noticeably bad. > You'll see problems in y %% 2 for any integer a > 1 with m = 2/(1 + 2a) > * 2^32, problems in y %% 3 for m = 3/(1 + 3a)*2^32 or m = 3/(2 + > 3a)*2^32, etc. > > So perhaps I'm starting to be convinced that the default sample() should > be fixed.
I find this discussion fascinating. I normally test random numbers in different languages every now and again using various methods. One simple check that I do is to use Michal Zalewski's method when he studied Strange Attractors and Initial TCP/IP Sequence Numbers: http://lcamtuf.coredump.cx/newtcp/ https://pdfs.semanticscholar.org/ adb7/069984e3fa48505cd5081ec118ccb95529a3.pdf The technique works by mapping the dynamics of the generated numbers into a three-dimensional phase space. This is then plotted in a graph so that you can visually see if something odd is going on. I used runif(10000, min = 0, max = 65535) to get a set of numbers. This is the resulting plot that was generated from R's numbers using this technique: http://people.redhat.com/sgrubb/files/r-random.jpg And for comparison this was generated by collecting the same number of samples from the bash shell: http://people.redhat.com/sgrubb/files/bash-random.jpg The net result is that it shows some banding in the R generated random numbers where bash has uniform random numbers with no discernible pattern. Best Regards, -Steve ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel