A quick point of order here: arguing with Duncan in this forum is helpful to expose ideas, but probably neither side will convince the other; eventually, if you want this adopted in core R, you'll need to convince an R-core member to pursue this fix.
In the meantime, a good, well-tested implementation in a user-contributed package (presumably written in C for speed) would be enormously useful. Volunteers ... ? On 2018-09-19 04:19 PM, Duncan Murdoch wrote: > On 19/09/2018 3:52 PM, Philip B. Stark wrote: >> Hi Duncan-- >> >> Nice simulation! >> >> The absolute difference in probabilities is small, but the maximum >> relative difference grows from something negligible to almost 2 as m >> approaches 2**31. >> >> Because the L_1 distance between the uniform distribution on {1, ..., >> m} and what you actually get is large, there have to be test functions >> whose expectations are quite different under the two distributions. > > That is a mathematically true statement, but I suspect it is not very > relevant. Pseudo-random number generators always have test functions > whose sample averages are quite different from the expectation under the > true distribution. Remember Von Neumann's "state of sin" quote. The > bug in sample() just means it is easier to find such a function than it > would otherwise be. > > The practical question is whether such a function is likely to arise in > practice or not. > >> Whether those correspond to commonly used statistics or not, I have no >> idea. > > I am pretty confident that this bug rarely matters. > >> Regarding backwards compatibility: as a user, I'd rather the default >> sample() do the best possible thing, and take an extra step to use >> something like sample(..., legacy=TRUE) if I want to reproduce old >> results. > > I suspect there's a good chance the bug I discovered today (non-integer > x values not being truncated) will be declared to be a feature, and the > documentation will be changed. Then the rejection sampling approach > would need to be quite a bit more complicated. > > I think a documentation warning about the accuracy of sampling > probabilities would also be a sufficient fix here, and would be quite a > bit less trouble than changing the default sample(). But as I said in > my original post, a contribution of a function without this bug would be > a nice addition. > > Duncan Murdoch > >> >> Regards, >> Philip >> >> On Wed, Sep 19, 2018 at 9:50 AM Duncan Murdoch >> <murdoch.dun...@gmail.com <mailto:murdoch.dun...@gmail.com>> wrote: >> >> On 19/09/2018 12:23 PM, Philip B. Stark wrote: >> > No, the 2nd call only happens when m > 2**31. Here's the code: >> >> Yes, you're right. Sorry! >> >> So the ratio really does come close to 2. However, the difference in >> probabilities between outcomes is still at most 2^-32 when m is less >> than that cutoff. That's not feasible to detect; the only detectable >> difference would happen if some event was constructed to hold an >> abundance of outcomes with especially low (or especially high) >> probability. >> >> As I said in my original post, it's probably not hard to construct >> such >> a thing, but as I've said more recently, it probably wouldn't >> happen by >> chance. Here's one attempt to do it: >> >> Call the values from unif_rand() "the unif_rand() outcomes". Call >> the >> values from sample() the sample outcomes. >> >> It would be easiest to see the error if half of the sample() outcomes >> used two unif_rand() outcomes, and half used just one. That would >> mean >> m should be (2/3) * 2^32, but that's too big and would trigger the >> other >> version. >> >> So how about half use 2 unif_rands(), and half use 3? That means m = >> (2/5) * 2^32 = 1717986918. A good guess is that sample() outcomes >> would >> alternate between the two possibilities, so our event could be even >> versus odd outcomes. >> >> Let's try it: >> >> > m <- (2/5)*2^32 >> > m > 2^31 >> [1] FALSE >> > x <- sample(m, 1000000, replace = TRUE) >> > table(x %% 2) >> >> 0 1 >> 399850 600150 >> >> Since m is an even number, the true proportions of evens and odds >> should >> be exactly 0.5. That's some pretty strong evidence of the bug in the >> generator. (Note that the ratio of the observed probabilities is >> about >> 1.5, so I may not be the first person to have done this.) >> >> I'm still not convinced that there has ever been a simulation run >> with >> detectable bias compared to Monte Carlo error unless it (like this >> one) >> was designed specifically to show the problem. >> >> Duncan Murdoch >> >> > >> > (RNG.c, lines 793ff) >> > >> > double R_unif_index(double dn) >> > { >> > double cut = INT_MAX; >> > >> > switch(RNG_kind) { >> > case KNUTH_TAOCP: >> > case USER_UNIF: >> > case KNUTH_TAOCP2: >> > cut = 33554431.0; /* 2^25 - 1 */ >> > break; >> > default: >> > break; >> > } >> > >> > double u = dn > cut ? ru() : unif_rand(); >> > return floor(dn * u); >> > } >> > >> > On Wed, Sep 19, 2018 at 9:20 AM Duncan Murdoch >> <murdoch.dun...@gmail.com <mailto:murdoch.dun...@gmail.com> >> > <mailto:murdoch.dun...@gmail.com >> <mailto:murdoch.dun...@gmail.com>>> wrote: >> > >> > On 19/09/2018 12:09 PM, Philip B. Stark wrote: >> > > The 53 bits only encode at most 2^{32} possible values, >> because the >> > > source of the float is the output of a 32-bit PRNG (the >> obsolete >> > version >> > > of MT). 53 bits isn't the relevant number here. >> > >> > No, two calls to unif_rand() are used. There are two 32 bit >> values, >> > but >> > some of the bits are thrown away. >> > >> > Duncan Murdoch >> > >> > > >> > > The selection ratios can get close to 2. Computer >> scientists >> > don't do it >> > > the way R does, for a reason. >> > > >> > > Regards, >> > > Philip >> > > >> > > On Wed, Sep 19, 2018 at 9:05 AM Duncan Murdoch >> > <murdoch.dun...@gmail.com <mailto:murdoch.dun...@gmail.com> >> <mailto:murdoch.dun...@gmail.com <mailto:murdoch.dun...@gmail.com>> >> > > <mailto:murdoch.dun...@gmail.com >> <mailto:murdoch.dun...@gmail.com> >> > <mailto:murdoch.dun...@gmail.com >> <mailto:murdoch.dun...@gmail.com>>>> wrote: >> > > >> > > On 19/09/2018 9:09 AM, Iñaki Ucar wrote: >> > > > El mié., 19 sept. 2018 a las 14:43, Duncan Murdoch >> > > > (<murdoch.dun...@gmail.com >> <mailto:murdoch.dun...@gmail.com> >> > <mailto:murdoch.dun...@gmail.com >> <mailto:murdoch.dun...@gmail.com>> <mailto:murdoch.dun...@gmail.com >> <mailto:murdoch.dun...@gmail.com> >> > <mailto:murdoch.dun...@gmail.com >> <mailto:murdoch.dun...@gmail.com>>>>) >> > > escribió: >> > > >> >> > > >> On 18/09/2018 5:46 PM, Carl Boettiger wrote: >> > > >>> Dear list, >> > > >>> >> > > >>> It looks to me that R samples random integers >> using an >> > > intuitive but biased >> > > >>> algorithm by going from a random number on >> [0,1) from >> > the PRNG >> > > to a random >> > > >>> integer, e.g. >> > > >>> >> > > >> https://github.com/wch/r-source/blob/tags/R-3-5-1/src/main/RNG.c#L808 >> > > >>> >> > > >>> Many other languages use various rejection >> sampling >> > approaches >> > > which >> > > >>> provide an unbiased method for sampling, such as >> in Go, >> > python, >> > > and others >> > > >>> described here: >> https://arxiv.org/abs/1805.10941 (I >> > believe the >> > > biased >> > > >>> algorithm currently used in R is also described >> there). I'm >> > > not an expert >> > > >>> in this area, but does it make sense for the R to >> adopt >> > one of >> > > the unbiased >> > > >>> random sample algorithms outlined there and used >> in other >> > > languages? Would >> > > >>> a patch providing such an algorithm be welcome? >> What >> > concerns >> > > would need to >> > > >>> be addressed first? >> > > >>> >> > > >>> I believe this issue was also raised by Killie & >> Philip in >> > > >>> >> > http://r.789695.n4.nabble.com/Bug-in-sample-td4729483.html, and >> > > more >> > > >>> recently in >> > > >>> >> > > >> > >> https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf >> >> <https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf> >> > >> <https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf> >> > > >> > >> <https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>, >> > > >>> pointing to the python implementation for >> comparison: >> > > >>> >> > > >> > >> >> https://github.com/statlab/cryptorandom/blob/master/cryptorandom/cryptorandom.py#L265 >> >> > > >> >> > > >> I think the analyses are correct, but I doubt if a >> change >> > to the >> > > default >> > > >> is likely to be accepted as it would make it more >> > difficult to >> > > reproduce >> > > >> older results. >> > > >> >> > > >> On the other hand, a contribution of a new >> function like >> > > sample() but >> > > >> not suffering from the bias would be good. The >> normal way to >> > > make such >> > > >> a contribution is in a user contributed package. >> > > >> >> > > >> By the way, R code illustrating the bias is >> probably not very >> > > hard to >> > > >> put together. I believe the bias manifests >> itself in >> > sample() >> > > producing >> > > >> values with two different probabilities (instead >> of all equal >> > > >> probabilities). Those may differ by as much as >> one part in >> > > 2^32. It's >> > > > >> > > > According to Kellie and Philip, in the attachment >> of the >> > thread >> > > > referenced by Carl, "The maximum ratio of selection >> > probabilities can >> > > > get as large as 1.5 if n is just below 2^31". >> > > >> > > Sorry, I didn't write very well. I meant to say >> that the >> > difference in >> > > probabilities would be 2^-32, not that the ratio of >> > probabilities would >> > > be 1 + 2^-32. >> > > >> > > By the way, I don't see the statement giving the >> ratio as >> > 1.5, but >> > > maybe >> > > I was looking in the wrong place. In Theorem 1 of the >> paper >> > I was >> > > looking in the ratio was "1 + m 2^{-w + 1}". In that >> formula >> > m is your >> > > n. If it is near 2^31, R uses w = 57 random bits, so >> the ratio >> > > would be >> > > very, very small (one part in 2^25). >> > > >> > > The worst case for R would happen when m is just below >> > 2^25, where w >> > > is at least 31 for the default generators. In that >> case the >> > ratio >> > > could >> > > be about 1.03. >> > > >> > > Duncan Murdoch >> > > >> > > >> > > >> > > -- >> > > Philip B. Stark | Associate Dean, Mathematical and Physical >> > Sciences | >> > > Professor, Department of Statistics | >> > > University of California >> > > Berkeley, CA 94720-3860 | 510-394-5077 | >> > statistics.berkeley.edu/~stark >> <http://statistics.berkeley.edu/%7Estark> >> > <http://statistics.berkeley.edu/%7Estark> >> > > <http://statistics.berkeley.edu/%7Estark> | >> > > @philipbstark >> > > >> > >> > >> > >> > -- >> > Philip B. Stark | Associate Dean, Mathematical and Physical >> Sciences | >> > Professor, Department of Statistics | >> > University of California >> > Berkeley, CA 94720-3860 | 510-394-5077 | >> statistics.berkeley.edu/~stark >> <http://statistics.berkeley.edu/%7Estark> >> > <http://statistics.berkeley.edu/%7Estark> | >> > @philipbstark >> > >> >> >> >> -- >> Philip B. Stark | Associate Dean, Mathematical and Physical Sciences | >> Professor, Department of Statistics | >> University of California >> Berkeley, CA 94720-3860 | 510-394-5077 | >> statistics.berkeley.edu/~stark >> <http://statistics.berkeley.edu/%7Estark> | >> @philipbstark >> > > ______________________________________________ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel