The same issue occurs in walker_ProbSampleReplace() in random.c, lines 386-387:
rU = unif_rand() * n; k = (int) rU; Cheers, Philip On Wed, Sep 19, 2018 at 3:08 PM Duncan Murdoch <murdoch.dun...@gmail.com> wrote: > On 19/09/2018 5:57 PM, David Hugh-Jones wrote: > > > > It doesn't seem too hard to come up with plausible ways in which this > > could give bad results. Suppose I sample rows from a large dataset, > > maybe for bootstrapping. Suppose the rows are non-randomly ordered, e.g. > > odd rows are males, even rows are females. Oops! Very non-representative > > sample, bootstrap p values are garbage. > > That would only happen if your dataset was exactly 1717986918 elements > in size. (And in fact, it will be less extreme than I posted: I had x > set to 1717986918.4, as described in another thread. If you use an > integer value you need a different pattern; add or subtract an element > or two and the pattern needed to see a problem changes drastically.) > > But if you're sampling from a dataset of that exact size, then you > should worry about this bug. Don't use sample(). Use the algorithm that > Carl described. > > Duncan Murdoch > > > > > David > > > > On Wed, 19 Sep 2018 at 21:20, Duncan Murdoch <murdoch.dun...@gmail.com > > <mailto:murdoch.dun...@gmail.com>> wrote: > > > > On 19/09/2018 3:52 PM, Philip B. Stark wrote: > > > Hi Duncan-- > > > > > > > > > > That is a mathematically true statement, but I suspect it is not very > > relevant. Pseudo-random number generators always have test functions > > whose sample averages are quite different from the expectation under > > the > > true distribution. Remember Von Neumann's "state of sin" quote. The > > bug in sample() just means it is easier to find such a function than > it > > would otherwise be. > > > > The practical question is whether such a function is likely to arise > in > > practice or not. > > > > > Whether those correspond to commonly used statistics or not, I > > have no > > > idea. > > > > I am pretty confident that this bug rarely matters. > > > > > Regarding backwards compatibility: as a user, I'd rather the > default > > > sample() do the best possible thing, and take an extra step to use > > > something like sample(..., legacy=TRUE) if I want to reproduce > > old results. > > > > I suspect there's a good chance the bug I discovered today > (non-integer > > x values not being truncated) will be declared to be a feature, and > the > > documentation will be changed. Then the rejection sampling approach > > would need to be quite a bit more complicated. > > > > I think a documentation warning about the accuracy of sampling > > probabilities would also be a sufficient fix here, and would be > quite a > > bit less trouble than changing the default sample(). But as I said > in > > my original post, a contribution of a function without this bug > > would be > > a nice addition. > > > > Duncan Murdoch > > > > > > > > Regards, > > > Philip > > > > > > On Wed, Sep 19, 2018 at 9:50 AM Duncan Murdoch > > <murdoch.dun...@gmail.com <mailto:murdoch.dun...@gmail.com> > > > <mailto:murdoch.dun...@gmail.com > > <mailto:murdoch.dun...@gmail.com>>> wrote: > > > > > > On 19/09/2018 12:23 PM, Philip B. Stark wrote: > > > > No, the 2nd call only happens when m > 2**31. Here's the > code: > > > > > > Yes, you're right. Sorry! > > > > > > So the ratio really does come close to 2. However, the > > difference in > > > probabilities between outcomes is still at most 2^-32 when m > > is less > > > than that cutoff. That's not feasible to detect; the only > > detectable > > > difference would happen if some event was constructed to hold > an > > > abundance of outcomes with especially low (or especially high) > > > probability. > > > > > > As I said in my original post, it's probably not hard to > > construct such > > > a thing, but as I've said more recently, it probably wouldn't > > happen by > > > chance. Here's one attempt to do it: > > > > > > Call the values from unif_rand() "the unif_rand() outcomes". > > Call the > > > values from sample() the sample outcomes. > > > > > > It would be easiest to see the error if half of the sample() > > outcomes > > > used two unif_rand() outcomes, and half used just one. That > > would mean > > > m should be (2/3) * 2^32, but that's too big and would > > trigger the > > > other > > > version. > > > > > > So how about half use 2 unif_rands(), and half use 3? That > > means m = > > > (2/5) * 2^32 = 1717986918. A good guess is that sample() > > outcomes > > > would > > > alternate between the two possibilities, so our event could > > be even > > > versus odd outcomes. > > > > > > Let's try it: > > > > > > > m <- (2/5)*2^32 > > > > m > 2^31 > > > [1] FALSE > > > > x <- sample(m, 1000000, replace = TRUE) > > > > table(x %% 2) > > > > > > 0 1 > > > 399850 600150 > > > > > > Since m is an even number, the true proportions of evens and > odds > > > should > > > be exactly 0.5. That's some pretty strong evidence of the > > bug in the > > > generator. (Note that the ratio of the observed > > probabilities is about > > > 1.5, so I may not be the first person to have done this.) > > > > > > I'm still not convinced that there has ever been a simulation > > run with > > > detectable bias compared to Monte Carlo error unless it (like > > this one) > > > was designed specifically to show the problem. > > > > > > Duncan Murdoch > > > > > > > > > > > (RNG.c, lines 793ff) > > > > > > > > double R_unif_index(double dn) > > > > { > > > > double cut = INT_MAX; > > > > > > > > switch(RNG_kind) { > > > > case KNUTH_TAOCP: > > > > case USER_UNIF: > > > > case KNUTH_TAOCP2: > > > > cut = 33554431.0; /* 2^25 - 1 */ > > > > break; > > > > default: > > > > break; > > > > } > > > > > > > > double u = dn > cut ? ru() : unif_rand(); > > > > return floor(dn * u); > > > > } > > > > > > > > On Wed, Sep 19, 2018 at 9:20 AM Duncan Murdoch > > > <murdoch.dun...@gmail.com <mailto:murdoch.dun...@gmail.com> > > <mailto:murdoch.dun...@gmail.com <mailto:murdoch.dun...@gmail.com>> > > > > <mailto:murdoch.dun...@gmail.com > > <mailto:murdoch.dun...@gmail.com> > > > <mailto:murdoch.dun...@gmail.com > > <mailto:murdoch.dun...@gmail.com>>>> wrote: > > > > > > > > On 19/09/2018 12:09 PM, Philip B. Stark wrote: > > > > > The 53 bits only encode at most 2^{32} possible > values, > > > because the > > > > > source of the float is the output of a 32-bit PRNG > (the > > > obsolete > > > > version > > > > > of MT). 53 bits isn't the relevant number here. > > > > > > > > No, two calls to unif_rand() are used. There are two > > 32 bit > > > values, > > > > but > > > > some of the bits are thrown away. > > > > > > > > Duncan Murdoch > > > > > > > > > > > > > > The selection ratios can get close to 2. Computer > > scientists > > > > don't do it > > > > > the way R does, for a reason. > > > > > > > > > > Regards, > > > > > Philip > > > > > > > > > > On Wed, Sep 19, 2018 at 9:05 AM Duncan Murdoch > > > > <murdoch.dun...@gmail.com > > <mailto:murdoch.dun...@gmail.com> <mailto:murdoch.dun...@gmail.com > > <mailto:murdoch.dun...@gmail.com>> > > > <mailto:murdoch.dun...@gmail.com > > <mailto:murdoch.dun...@gmail.com> <mailto:murdoch.dun...@gmail.com > > <mailto:murdoch.dun...@gmail.com>>> > > > > > <mailto:murdoch.dun...@gmail.com > > <mailto:murdoch.dun...@gmail.com> > > > <mailto:murdoch.dun...@gmail.com > > <mailto:murdoch.dun...@gmail.com>> > > > > <mailto:murdoch.dun...@gmail.com > > <mailto:murdoch.dun...@gmail.com> > > > <mailto:murdoch.dun...@gmail.com > > <mailto:murdoch.dun...@gmail.com>>>>> wrote: > > > > > > > > > > On 19/09/2018 9:09 AM, Iñaki Ucar wrote: > > > > > > El mié., 19 sept. 2018 a las 14:43, Duncan > > Murdoch > > > > > > (<murdoch.dun...@gmail.com > > <mailto:murdoch.dun...@gmail.com> > > > <mailto:murdoch.dun...@gmail.com > > <mailto:murdoch.dun...@gmail.com>> > > > > <mailto:murdoch.dun...@gmail.com > > <mailto:murdoch.dun...@gmail.com> > > > <mailto:murdoch.dun...@gmail.com > > <mailto:murdoch.dun...@gmail.com>>> <mailto:murdoch.dun...@gmail.com > > <mailto:murdoch.dun...@gmail.com> > > > <mailto:murdoch.dun...@gmail.com > > <mailto:murdoch.dun...@gmail.com>> > > > > <mailto:murdoch.dun...@gmail.com > > <mailto:murdoch.dun...@gmail.com> > > > <mailto:murdoch.dun...@gmail.com > > <mailto:murdoch.dun...@gmail.com>>>>>) > > > > > escribió: > > > > > >> > > > > > >> On 18/09/2018 5:46 PM, Carl Boettiger wrote: > > > > > >>> Dear list, > > > > > >>> > > > > > >>> It looks to me that R samples random > integers > > > using an > > > > > intuitive but biased > > > > > >>> algorithm by going from a random number on > > [0,1) from > > > > the PRNG > > > > > to a random > > > > > >>> integer, e.g. > > > > > >>> > > > > > > > > > https://github.com/wch/r-source/blob/tags/R-3-5-1/src/main/RNG.c#L808 > > > > > >>> > > > > > >>> Many other languages use various rejection > > sampling > > > > approaches > > > > > which > > > > > >>> provide an unbiased method for sampling, > > such as > > > in Go, > > > > python, > > > > > and others > > > > > >>> described here: > > https://arxiv.org/abs/1805.10941 (I > > > > believe the > > > > > biased > > > > > >>> algorithm currently used in R is also > > described > > > there). I'm > > > > > not an expert > > > > > >>> in this area, but does it make sense for > > the R to > > > adopt > > > > one of > > > > > the unbiased > > > > > >>> random sample algorithms outlined there > > and used > > > in other > > > > > languages? Would > > > > > >>> a patch providing such an algorithm be > > welcome? What > > > > concerns > > > > > would need to > > > > > >>> be addressed first? > > > > > >>> > > > > > >>> I believe this issue was also raised by > > Killie & > > > Philip in > > > > > >>> > > > > > > http://r.789695.n4.nabble.com/Bug-in-sample-td4729483.html, and > > > > > more > > > > > >>> recently in > > > > > >>> > > > > > > > > > > > > > > https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf > > < > https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf> > > > > > < > https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf> > > > > > > > > > < > https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf> > > > > > > > > > > > > > > < > https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>, > > > > > >>> pointing to the python implementation for > > comparison: > > > > > >>> > > > > > > > > > > > > > > > https://github.com/statlab/cryptorandom/blob/master/cryptorandom/cryptorandom.py#L265 > > > > > >> > > > > > >> I think the analyses are correct, but I > > doubt if a > > > change > > > > to the > > > > > default > > > > > >> is likely to be accepted as it would make > > it more > > > > difficult to > > > > > reproduce > > > > > >> older results. > > > > > >> > > > > > >> On the other hand, a contribution of a new > > > function like > > > > > sample() but > > > > > >> not suffering from the bias would be good. > The > > > normal way to > > > > > make such > > > > > >> a contribution is in a user contributed > > package. > > > > > >> > > > > > >> By the way, R code illustrating the bias is > > > probably not very > > > > > hard to > > > > > >> put together. I believe the bias manifests > > itself in > > > > sample() > > > > > producing > > > > > >> values with two different probabilities > > (instead > > > of all equal > > > > > >> probabilities). Those may differ by as > much as > > > one part in > > > > > 2^32. It's > > > > > > > > > > > > According to Kellie and Philip, in the > > attachment > > > of the > > > > thread > > > > > > referenced by Carl, "The maximum ratio of > > selection > > > > probabilities can > > > > > > get as large as 1.5 if n is just below 2^31". > > > > > > > > > > Sorry, I didn't write very well. I meant to > > say that the > > > > difference in > > > > > probabilities would be 2^-32, not that the > ratio of > > > > probabilities would > > > > > be 1 + 2^-32. > > > > > > > > > > By the way, I don't see the statement giving > > the ratio as > > > > 1.5, but > > > > > maybe > > > > > I was looking in the wrong place. In Theorem 1 > > of the > > > paper > > > > I was > > > > > looking in the ratio was "1 + m 2^{-w + 1}". > > In that > > > formula > > > > m is your > > > > > n. If it is near 2^31, R uses w = 57 random > > bits, so > > > the ratio > > > > > would be > > > > > very, very small (one part in 2^25). > > > > > > > > > > The worst case for R would happen when m is > > just below > > > > 2^25, where w > > > > > is at least 31 for the default generators. In > that > > > case the > > > > ratio > > > > > could > > > > > be about 1.03. > > > > > > > > > > Duncan Murdoch > > > > > > > > > > > > > > > > > > > > -- > > > > > Philip B. Stark | Associate Dean, Mathematical and > > Physical > > > > Sciences | > > > > > Professor, Department of Statistics | > > > > > University of California > > > > > Berkeley, CA 94720-3860 | 510-394-5077 | > > > > statistics.berkeley.edu/~stark > > <http://statistics.berkeley.edu/%7Estark> > > > <http://statistics.berkeley.edu/%7Estark> > > > > <http://statistics.berkeley.edu/%7Estark> > > > > > <http://statistics.berkeley.edu/%7Estark> | > > > > > @philipbstark > > > > > > > > > > > > > > > > > > > > > -- > > > > Philip B. Stark | Associate Dean, Mathematical and Physical > > > Sciences | > > > > Professor, Department of Statistics | > > > > University of California > > > > Berkeley, CA 94720-3860 | 510-394-5077 | > > > statistics.berkeley.edu/~stark > > <http://statistics.berkeley.edu/%7Estark> > > > <http://statistics.berkeley.edu/%7Estark> > > > > <http://statistics.berkeley.edu/%7Estark> | > > > > @philipbstark > > > > > > > > > > > > > > > > -- > > > Philip B. Stark | Associate Dean, Mathematical and Physical > > Sciences | > > > Professor, Department of Statistics | > > > University of California > > > Berkeley, CA 94720-3860 | 510-394-5077 | > > statistics.berkeley.edu/~stark > > <http://statistics.berkeley.edu/%7Estark> > > > <http://statistics.berkeley.edu/%7Estark> | > > > @philipbstark > > > > > > > ______________________________________________ > > R-devel@r-project.org <mailto:R-devel@r-project.org> mailing list > > https://stat.ethz.ch/mailman/listinfo/r-devel > > > > -- > > Sent from Gmail Mobile > > -- Philip B. Stark | Associate Dean, Mathematical and Physical Sciences | Professor, Department of Statistics | University of California Berkeley, CA 94720-3860 | 510-394-5077 | statistics.berkeley.edu/~stark | @philipbstark [[alternative HTML version deleted]] ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel