For a well-tested C algorithm, based on my reading of Lemire, the unbiased "algorithm 3" in https://arxiv.org/abs/1805.10941 is part already of the C standard library in OpenBSD and macOS (as arc4random_uniform), and in the GNU standard library. Lemire also provides C++ code in the appendix of his piece for both this and the faster "nearly divisionless" algorithm.
It would be excellent if any R core members were interested in considering bindings to these algorithms as a patch, or might express expectations for how that patch would have to operate (e.g. re Duncan's comment about non-integer arguments to sample size). Otherwise, an R package binding seems like a good starting point, but I'm not the right volunteer. On Wed, Sep 19, 2018 at 4:22 PM Ben Bolker <bbol...@gmail.com> wrote: > > A quick point of order here: arguing with Duncan in this forum is > helpful to expose ideas, but probably neither side will convince the > other; eventually, if you want this adopted in core R, you'll need to > convince an R-core member to pursue this fix. > > In the meantime, a good, well-tested implementation in a > user-contributed package (presumably written in C for speed) would be > enormously useful. Volunteers ... ? > > > > On 2018-09-19 04:19 PM, Duncan Murdoch wrote: > > On 19/09/2018 3:52 PM, Philip B. Stark wrote: > >> Hi Duncan-- > >> > >> Nice simulation! > >> > >> The absolute difference in probabilities is small, but the maximum > >> relative difference grows from something negligible to almost 2 as m > >> approaches 2**31. > >> > >> Because the L_1 distance between the uniform distribution on {1, ..., > >> m} and what you actually get is large, there have to be test functions > >> whose expectations are quite different under the two distributions. > > > > That is a mathematically true statement, but I suspect it is not very > > relevant. Pseudo-random number generators always have test functions > > whose sample averages are quite different from the expectation under the > > true distribution. Remember Von Neumann's "state of sin" quote. The > > bug in sample() just means it is easier to find such a function than it > > would otherwise be. > > > > The practical question is whether such a function is likely to arise in > > practice or not. > > > >> Whether those correspond to commonly used statistics or not, I have no > >> idea. > > > > I am pretty confident that this bug rarely matters. > > > >> Regarding backwards compatibility: as a user, I'd rather the default > >> sample() do the best possible thing, and take an extra step to use > >> something like sample(..., legacy=TRUE) if I want to reproduce old > >> results. > > > > I suspect there's a good chance the bug I discovered today (non-integer > > x values not being truncated) will be declared to be a feature, and the > > documentation will be changed. Then the rejection sampling approach > > would need to be quite a bit more complicated. > > > > I think a documentation warning about the accuracy of sampling > > probabilities would also be a sufficient fix here, and would be quite a > > bit less trouble than changing the default sample(). But as I said in > > my original post, a contribution of a function without this bug would be > > a nice addition. > > > > Duncan Murdoch > > > >> > >> Regards, > >> Philip > >> > >> On Wed, Sep 19, 2018 at 9:50 AM Duncan Murdoch > >> <murdoch.dun...@gmail.com <mailto:murdoch.dun...@gmail.com>> wrote: > >> > >> On 19/09/2018 12:23 PM, Philip B. Stark wrote: > >> > No, the 2nd call only happens when m > 2**31. Here's the code: > >> > >> Yes, you're right. Sorry! > >> > >> So the ratio really does come close to 2. However, the difference > in > >> probabilities between outcomes is still at most 2^-32 when m is less > >> than that cutoff. That's not feasible to detect; the only > detectable > >> difference would happen if some event was constructed to hold an > >> abundance of outcomes with especially low (or especially high) > >> probability. > >> > >> As I said in my original post, it's probably not hard to construct > >> such > >> a thing, but as I've said more recently, it probably wouldn't > >> happen by > >> chance. Here's one attempt to do it: > >> > >> Call the values from unif_rand() "the unif_rand() outcomes". Call > >> the > >> values from sample() the sample outcomes. > >> > >> It would be easiest to see the error if half of the sample() > outcomes > >> used two unif_rand() outcomes, and half used just one. That would > >> mean > >> m should be (2/3) * 2^32, but that's too big and would trigger the > >> other > >> version. > >> > >> So how about half use 2 unif_rands(), and half use 3? That means m > = > >> (2/5) * 2^32 = 1717986918. A good guess is that sample() outcomes > >> would > >> alternate between the two possibilities, so our event could be even > >> versus odd outcomes. > >> > >> Let's try it: > >> > >> > m <- (2/5)*2^32 > >> > m > 2^31 > >> [1] FALSE > >> > x <- sample(m, 1000000, replace = TRUE) > >> > table(x %% 2) > >> > >> 0 1 > >> 399850 600150 > >> > >> Since m is an even number, the true proportions of evens and odds > >> should > >> be exactly 0.5. That's some pretty strong evidence of the bug in > the > >> generator. (Note that the ratio of the observed probabilities is > >> about > >> 1.5, so I may not be the first person to have done this.) > >> > >> I'm still not convinced that there has ever been a simulation run > >> with > >> detectable bias compared to Monte Carlo error unless it (like this > >> one) > >> was designed specifically to show the problem. > >> > >> Duncan Murdoch > >> > >> > > >> > (RNG.c, lines 793ff) > >> > > >> > double R_unif_index(double dn) > >> > { > >> > double cut = INT_MAX; > >> > > >> > switch(RNG_kind) { > >> > case KNUTH_TAOCP: > >> > case USER_UNIF: > >> > case KNUTH_TAOCP2: > >> > cut = 33554431.0; /* 2^25 - 1 */ > >> > break; > >> > default: > >> > break; > >> > } > >> > > >> > double u = dn > cut ? ru() : unif_rand(); > >> > return floor(dn * u); > >> > } > >> > > >> > On Wed, Sep 19, 2018 at 9:20 AM Duncan Murdoch > >> <murdoch.dun...@gmail.com <mailto:murdoch.dun...@gmail.com> > >> > <mailto:murdoch.dun...@gmail.com > >> <mailto:murdoch.dun...@gmail.com>>> wrote: > >> > > >> > On 19/09/2018 12:09 PM, Philip B. Stark wrote: > >> > > The 53 bits only encode at most 2^{32} possible values, > >> because the > >> > > source of the float is the output of a 32-bit PRNG (the > >> obsolete > >> > version > >> > > of MT). 53 bits isn't the relevant number here. > >> > > >> > No, two calls to unif_rand() are used. There are two 32 bit > >> values, > >> > but > >> > some of the bits are thrown away. > >> > > >> > Duncan Murdoch > >> > > >> > > > >> > > The selection ratios can get close to 2. Computer > >> scientists > >> > don't do it > >> > > the way R does, for a reason. > >> > > > >> > > Regards, > >> > > Philip > >> > > > >> > > On Wed, Sep 19, 2018 at 9:05 AM Duncan Murdoch > >> > <murdoch.dun...@gmail.com <mailto:murdoch.dun...@gmail.com> > >> <mailto:murdoch.dun...@gmail.com <mailto:murdoch.dun...@gmail.com>> > >> > > <mailto:murdoch.dun...@gmail.com > >> <mailto:murdoch.dun...@gmail.com> > >> > <mailto:murdoch.dun...@gmail.com > >> <mailto:murdoch.dun...@gmail.com>>>> wrote: > >> > > > >> > > On 19/09/2018 9:09 AM, Iñaki Ucar wrote: > >> > > > El mié., 19 sept. 2018 a las 14:43, Duncan Murdoch > >> > > > (<murdoch.dun...@gmail.com > >> <mailto:murdoch.dun...@gmail.com> > >> > <mailto:murdoch.dun...@gmail.com > >> <mailto:murdoch.dun...@gmail.com>> <mailto:murdoch.dun...@gmail.com > >> <mailto:murdoch.dun...@gmail.com> > >> > <mailto:murdoch.dun...@gmail.com > >> <mailto:murdoch.dun...@gmail.com>>>>) > >> > > escribió: > >> > > >> > >> > > >> On 18/09/2018 5:46 PM, Carl Boettiger wrote: > >> > > >>> Dear list, > >> > > >>> > >> > > >>> It looks to me that R samples random integers > >> using an > >> > > intuitive but biased > >> > > >>> algorithm by going from a random number on > >> [0,1) from > >> > the PRNG > >> > > to a random > >> > > >>> integer, e.g. > >> > > >>> > >> > > > >> > https://github.com/wch/r-source/blob/tags/R-3-5-1/src/main/RNG.c#L808 > >> > > >>> > >> > > >>> Many other languages use various rejection > >> sampling > >> > approaches > >> > > which > >> > > >>> provide an unbiased method for sampling, such as > >> in Go, > >> > python, > >> > > and others > >> > > >>> described here: > >> https://arxiv.org/abs/1805.10941 (I > >> > believe the > >> > > biased > >> > > >>> algorithm currently used in R is also described > >> there). I'm > >> > > not an expert > >> > > >>> in this area, but does it make sense for the R to > >> adopt > >> > one of > >> > > the unbiased > >> > > >>> random sample algorithms outlined there and used > >> in other > >> > > languages? Would > >> > > >>> a patch providing such an algorithm be welcome? > >> What > >> > concerns > >> > > would need to > >> > > >>> be addressed first? > >> > > >>> > >> > > >>> I believe this issue was also raised by Killie & > >> Philip in > >> > > >>> > >> > http://r.789695.n4.nabble.com/Bug-in-sample-td4729483.html, and > >> > > more > >> > > >>> recently in > >> > > >>> > >> > > > >> > > >> https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf > >> > >> <https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf> > >> > > >> <https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf> > >> > > > >> > > >> <https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf > >, > >> > > >>> pointing to the python implementation for > >> comparison: > >> > > >>> > >> > > > >> > > >> > >> > https://github.com/statlab/cryptorandom/blob/master/cryptorandom/cryptorandom.py#L265 > >> > >> > > >> > >> > > >> I think the analyses are correct, but I doubt if a > >> change > >> > to the > >> > > default > >> > > >> is likely to be accepted as it would make it more > >> > difficult to > >> > > reproduce > >> > > >> older results. > >> > > >> > >> > > >> On the other hand, a contribution of a new > >> function like > >> > > sample() but > >> > > >> not suffering from the bias would be good. The > >> normal way to > >> > > make such > >> > > >> a contribution is in a user contributed package. > >> > > >> > >> > > >> By the way, R code illustrating the bias is > >> probably not very > >> > > hard to > >> > > >> put together. I believe the bias manifests > >> itself in > >> > sample() > >> > > producing > >> > > >> values with two different probabilities (instead > >> of all equal > >> > > >> probabilities). Those may differ by as much as > >> one part in > >> > > 2^32. It's > >> > > > > >> > > > According to Kellie and Philip, in the attachment > >> of the > >> > thread > >> > > > referenced by Carl, "The maximum ratio of selection > >> > probabilities can > >> > > > get as large as 1.5 if n is just below 2^31". > >> > > > >> > > Sorry, I didn't write very well. I meant to say > >> that the > >> > difference in > >> > > probabilities would be 2^-32, not that the ratio of > >> > probabilities would > >> > > be 1 + 2^-32. > >> > > > >> > > By the way, I don't see the statement giving the > >> ratio as > >> > 1.5, but > >> > > maybe > >> > > I was looking in the wrong place. In Theorem 1 of the > >> paper > >> > I was > >> > > looking in the ratio was "1 + m 2^{-w + 1}". In that > >> formula > >> > m is your > >> > > n. If it is near 2^31, R uses w = 57 random bits, so > >> the ratio > >> > > would be > >> > > very, very small (one part in 2^25). > >> > > > >> > > The worst case for R would happen when m is just > below > >> > 2^25, where w > >> > > is at least 31 for the default generators. In that > >> case the > >> > ratio > >> > > could > >> > > be about 1.03. > >> > > > >> > > Duncan Murdoch > >> > > > >> > > > >> > > > >> > > -- > >> > > Philip B. Stark | Associate Dean, Mathematical and > Physical > >> > Sciences | > >> > > Professor, Department of Statistics | > >> > > University of California > >> > > Berkeley, CA 94720-3860 | 510-394-5077 <(510)%20394-5077> > | > >> > statistics.berkeley.edu/~stark > >> <http://statistics.berkeley.edu/%7Estark> > >> > <http://statistics.berkeley.edu/%7Estark> > >> > > <http://statistics.berkeley.edu/%7Estark> | > >> > > @philipbstark > >> > > > >> > > >> > > >> > > >> > -- > >> > Philip B. Stark | Associate Dean, Mathematical and Physical > >> Sciences | > >> > Professor, Department of Statistics | > >> > University of California > >> > Berkeley, CA 94720-3860 | 510-394-5077 <(510)%20394-5077> | > >> statistics.berkeley.edu/~stark > >> <http://statistics.berkeley.edu/%7Estark> > >> > <http://statistics.berkeley.edu/%7Estark> | > >> > @philipbstark > >> > > >> > >> > >> > >> -- > >> Philip B. Stark | Associate Dean, Mathematical and Physical Sciences | > >> Professor, Department of Statistics | > >> University of California > >> Berkeley, CA 94720-3860 | 510-394-5077 <(510)%20394-5077> | > >> statistics.berkeley.edu/~stark > >> <http://statistics.berkeley.edu/%7Estark> | > >> @philipbstark > >> > > > > ______________________________________________ > > R-devel@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-devel > > ______________________________________________ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > -- http://carlboettiger.info [[alternative HTML version deleted]] ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel