One more thing, apropos this: I'm still not convinced that there has ever been a simulation run with detectable bias compared to Monte Carlo error unless it (like this one) was designed specifically to show the problem.
I often use random permutations to simulate p-values to calibrate permutation tests. If I'm trying to simulate the probability of a low-probability event, this could matter a lot. Best wishes, Philip On Wed, Sep 19, 2018 at 12:52 PM Philip B. Stark <st...@stat.berkeley.edu> wrote: > Hi Duncan-- > > Nice simulation! > > The absolute difference in probabilities is small, but the maximum > relative difference grows from something negligible to almost 2 as m > approaches 2**31. > > Because the L_1 distance between the uniform distribution on {1, ..., m} > and what you actually get is large, there have to be test functions whose > expectations are quite different under the two distributions. Whether those > correspond to commonly used statistics or not, I have no idea. > > Regarding backwards compatibility: as a user, I'd rather the default > sample() do the best possible thing, and take an extra step to use > something like sample(..., legacy=TRUE) if I want to reproduce old results. > > Regards, > Philip > > On Wed, Sep 19, 2018 at 9:50 AM Duncan Murdoch <murdoch.dun...@gmail.com> > wrote: > >> On 19/09/2018 12:23 PM, Philip B. Stark wrote: >> > No, the 2nd call only happens when m > 2**31. Here's the code: >> >> Yes, you're right. Sorry! >> >> So the ratio really does come close to 2. However, the difference in >> probabilities between outcomes is still at most 2^-32 when m is less >> than that cutoff. That's not feasible to detect; the only detectable >> difference would happen if some event was constructed to hold an >> abundance of outcomes with especially low (or especially high) >> probability. >> >> As I said in my original post, it's probably not hard to construct such >> a thing, but as I've said more recently, it probably wouldn't happen by >> chance. Here's one attempt to do it: >> >> Call the values from unif_rand() "the unif_rand() outcomes". Call the >> values from sample() the sample outcomes. >> >> It would be easiest to see the error if half of the sample() outcomes >> used two unif_rand() outcomes, and half used just one. That would mean >> m should be (2/3) * 2^32, but that's too big and would trigger the other >> version. >> >> So how about half use 2 unif_rands(), and half use 3? That means m = >> (2/5) * 2^32 = 1717986918. A good guess is that sample() outcomes would >> alternate between the two possibilities, so our event could be even >> versus odd outcomes. >> >> Let's try it: >> >> > m <- (2/5)*2^32 >> > m > 2^31 >> [1] FALSE >> > x <- sample(m, 1000000, replace = TRUE) >> > table(x %% 2) >> >> 0 1 >> 399850 600150 >> >> Since m is an even number, the true proportions of evens and odds should >> be exactly 0.5. That's some pretty strong evidence of the bug in the >> generator. (Note that the ratio of the observed probabilities is about >> 1.5, so I may not be the first person to have done this.) >> >> I'm still not convinced that there has ever been a simulation run with >> detectable bias compared to Monte Carlo error unless it (like this one) >> was designed specifically to show the problem. >> >> Duncan Murdoch >> >> > >> > (RNG.c, lines 793ff) >> > >> > double R_unif_index(double dn) >> > { >> > double cut = INT_MAX; >> > >> > switch(RNG_kind) { >> > case KNUTH_TAOCP: >> > case USER_UNIF: >> > case KNUTH_TAOCP2: >> > cut = 33554431.0; /* 2^25 - 1 */ >> > break; >> > default: >> > break; >> > } >> > >> > double u = dn > cut ? ru() : unif_rand(); >> > return floor(dn * u); >> > } >> > >> > On Wed, Sep 19, 2018 at 9:20 AM Duncan Murdoch < >> murdoch.dun...@gmail.com >> > <mailto:murdoch.dun...@gmail.com>> wrote: >> > >> > On 19/09/2018 12:09 PM, Philip B. Stark wrote: >> > > The 53 bits only encode at most 2^{32} possible values, because >> the >> > > source of the float is the output of a 32-bit PRNG (the obsolete >> > version >> > > of MT). 53 bits isn't the relevant number here. >> > >> > No, two calls to unif_rand() are used. There are two 32 bit values, >> > but >> > some of the bits are thrown away. >> > >> > Duncan Murdoch >> > >> > > >> > > The selection ratios can get close to 2. Computer scientists >> > don't do it >> > > the way R does, for a reason. >> > > >> > > Regards, >> > > Philip >> > > >> > > On Wed, Sep 19, 2018 at 9:05 AM Duncan Murdoch >> > <murdoch.dun...@gmail.com <mailto:murdoch.dun...@gmail.com> >> > > <mailto:murdoch.dun...@gmail.com >> > <mailto:murdoch.dun...@gmail.com>>> wrote: >> > > >> > > On 19/09/2018 9:09 AM, Iñaki Ucar wrote: >> > > > El mié., 19 sept. 2018 a las 14:43, Duncan Murdoch >> > > > (<murdoch.dun...@gmail.com >> > <mailto:murdoch.dun...@gmail.com> <mailto:murdoch.dun...@gmail.com >> > <mailto:murdoch.dun...@gmail.com>>>) >> > > escribió: >> > > >> >> > > >> On 18/09/2018 5:46 PM, Carl Boettiger wrote: >> > > >>> Dear list, >> > > >>> >> > > >>> It looks to me that R samples random integers using an >> > > intuitive but biased >> > > >>> algorithm by going from a random number on [0,1) from >> > the PRNG >> > > to a random >> > > >>> integer, e.g. >> > > >>> >> > > >> https://github.com/wch/r-source/blob/tags/R-3-5-1/src/main/RNG.c#L808 >> > > >>> >> > > >>> Many other languages use various rejection sampling >> > approaches >> > > which >> > > >>> provide an unbiased method for sampling, such as in Go, >> > python, >> > > and others >> > > >>> described here: https://arxiv.org/abs/1805.10941 (I >> > believe the >> > > biased >> > > >>> algorithm currently used in R is also described >> there). I'm >> > > not an expert >> > > >>> in this area, but does it make sense for the R to adopt >> > one of >> > > the unbiased >> > > >>> random sample algorithms outlined there and used in >> other >> > > languages? Would >> > > >>> a patch providing such an algorithm be welcome? What >> > concerns >> > > would need to >> > > >>> be addressed first? >> > > >>> >> > > >>> I believe this issue was also raised by Killie & Philip >> in >> > > >>> >> > http://r.789695.n4.nabble.com/Bug-in-sample-td4729483.html, and >> > > more >> > > >>> recently in >> > > >>> >> > > >> > https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf >> > < >> https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf> >> > > >> > < >> https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>, >> > > >>> pointing to the python implementation for comparison: >> > > >>> >> > > >> > >> https://github.com/statlab/cryptorandom/blob/master/cryptorandom/cryptorandom.py#L265 >> > > >> >> > > >> I think the analyses are correct, but I doubt if a change >> > to the >> > > default >> > > >> is likely to be accepted as it would make it more >> > difficult to >> > > reproduce >> > > >> older results. >> > > >> >> > > >> On the other hand, a contribution of a new function like >> > > sample() but >> > > >> not suffering from the bias would be good. The normal >> way to >> > > make such >> > > >> a contribution is in a user contributed package. >> > > >> >> > > >> By the way, R code illustrating the bias is probably not >> very >> > > hard to >> > > >> put together. I believe the bias manifests itself in >> > sample() >> > > producing >> > > >> values with two different probabilities (instead of all >> equal >> > > >> probabilities). Those may differ by as much as one part >> in >> > > 2^32. It's >> > > > >> > > > According to Kellie and Philip, in the attachment of the >> > thread >> > > > referenced by Carl, "The maximum ratio of selection >> > probabilities can >> > > > get as large as 1.5 if n is just below 2^31". >> > > >> > > Sorry, I didn't write very well. I meant to say that the >> > difference in >> > > probabilities would be 2^-32, not that the ratio of >> > probabilities would >> > > be 1 + 2^-32. >> > > >> > > By the way, I don't see the statement giving the ratio as >> > 1.5, but >> > > maybe >> > > I was looking in the wrong place. In Theorem 1 of the paper >> > I was >> > > looking in the ratio was "1 + m 2^{-w + 1}". In that formula >> > m is your >> > > n. If it is near 2^31, R uses w = 57 random bits, so the >> ratio >> > > would be >> > > very, very small (one part in 2^25). >> > > >> > > The worst case for R would happen when m is just below >> > 2^25, where w >> > > is at least 31 for the default generators. In that case the >> > ratio >> > > could >> > > be about 1.03. >> > > >> > > Duncan Murdoch >> > > >> > > >> > > >> > > -- >> > > Philip B. Stark | Associate Dean, Mathematical and Physical >> > Sciences | >> > > Professor, Department of Statistics | >> > > University of California >> > > Berkeley, CA 94720-3860 | 510-394-5077 | >> > statistics.berkeley.edu/~stark >> > <http://statistics.berkeley.edu/%7Estark> >> > > <http://statistics.berkeley.edu/%7Estark> | >> > > @philipbstark >> > > >> > >> > >> > >> > -- >> > Philip B. Stark | Associate Dean, Mathematical and Physical Sciences | >> > Professor, Department of Statistics | >> > University of California >> > Berkeley, CA 94720-3860 | 510-394-5077 | statistics.berkeley.edu/~stark >> > <http://statistics.berkeley.edu/%7Estark> | >> > @philipbstark >> > >> >> > > -- > Philip B. Stark | Associate Dean, Mathematical and Physical Sciences | > Professor, Department of Statistics | > University of California > Berkeley, CA 94720-3860 | 510-394-5077 | statistics.berkeley.edu/~stark | > @philipbstark > > -- Philip B. Stark | Associate Dean, Mathematical and Physical Sciences | Professor, Department of Statistics | University of California Berkeley, CA 94720-3860 | 510-394-5077 | statistics.berkeley.edu/~stark | @philipbstark [[alternative HTML version deleted]] ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel