Re: [Rd] Bias in R's random integers?

Duncan Murdoch Wed, 19 Sep 2018 09:51:02 -0700

On 19/09/2018 12:23 PM, Philip B. Stark wrote:

No, the 2nd call only happens when m > 2**31. Here's the code:


Yes, you're right. Sorry!

So the ratio really does come close to 2. However, the difference inprobabilities between outcomes is still at most 2^-32 when m is lessthan that cutoff. That's not feasible to detect; the only detectabledifference would happen if some event was constructed to hold anabundance of outcomes with especially low (or especially high) probability.

As I said in my original post, it's probably not hard to construct sucha thing, but as I've said more recently, it probably wouldn't happen bychance. Here's one attempt to do it:

Call the values from unif_rand() "the unif_rand() outcomes". Call thevalues from sample() the sample outcomes.

It would be easiest to see the error if half of the sample() outcomesused two unif_rand() outcomes, and half used just one. That would meanm should be (2/3) * 2^32, but that's too big and would trigger the otherversion.

So how about half use 2 unif_rands(), and half use 3? That means m =(2/5) * 2^32 = 1717986918. A good guess is that sample() outcomes wouldalternate between the two possibilities, so our event could be evenversus odd outcomes.


Let's try it:

> m <- (2/5)*2^32
> m > 2^31
[1] FALSE
> x <- sample(m, 1000000, replace = TRUE)
> table(x %% 2)

     0      1
399850 600150

Since m is an even number, the true proportions of evens and odds shouldbe exactly 0.5. That's some pretty strong evidence of the bug in thegenerator. (Note that the ratio of the observed probabilities is about1.5, so I may not be the first person to have done this.)

I'm still not convinced that there has ever been a simulation run withdetectable bias compared to Monte Carlo error unless it (like this one)was designed specifically to show the problem.


Duncan Murdoch


(RNG.c, lines 793ff)

double R_unif_index(double dn)
{
     double cut = INT_MAX;

     switch(RNG_kind) {
     case KNUTH_TAOCP:
     case USER_UNIF:
     case KNUTH_TAOCP2:
cut = 33554431.0; /* 2^25 - 1 */
break;
     default:
break;
    }

     double u = dn > cut ? ru() : unif_rand();
     return floor(dn * u);
}

On Wed, Sep 19, 2018 at 9:20 AM Duncan Murdoch <murdoch.dun...@gmail.com<mailto:murdoch.dun...@gmail.com>> wrote:


    On 19/09/2018 12:09 PM, Philip B. Stark wrote:
     > The 53 bits only encode at most 2^{32} possible values, because the
     > source of the float is the output of a 32-bit PRNG (the obsolete
    version
     > of MT). 53 bits isn't the relevant number here.

    No, two calls to unif_rand() are used.  There are two 32 bit values,
    but
    some of the bits are thrown away.

    Duncan Murdoch

     >
     > The selection ratios can get close to 2. Computer scientists
    don't do it
     > the way R does, for a reason.
     >
     > Regards,
     > Philip
     >
     > On Wed, Sep 19, 2018 at 9:05 AM Duncan Murdoch
    <murdoch.dun...@gmail.com <mailto:murdoch.dun...@gmail.com>
     > <mailto:murdoch.dun...@gmail.com
    <mailto:murdoch.dun...@gmail.com>>> wrote:
     >
     >     On 19/09/2018 9:09 AM, Iñaki Ucar wrote:
     >      > El mié., 19 sept. 2018 a las 14:43, Duncan Murdoch
     >      > (<murdoch.dun...@gmail.com
    <mailto:murdoch.dun...@gmail.com> <mailto:murdoch.dun...@gmail.com
    <mailto:murdoch.dun...@gmail.com>>>)
     >     escribió:
     >      >>
     >      >> On 18/09/2018 5:46 PM, Carl Boettiger wrote:
     >      >>> Dear list,
     >      >>>
     >      >>> It looks to me that R samples random integers using an
     >     intuitive but biased
     >      >>> algorithm by going from a random number on [0,1) from
    the PRNG
     >     to a random
     >      >>> integer, e.g.
     >      >>>
     > https://github.com/wch/r-source/blob/tags/R-3-5-1/src/main/RNG.c#L808
     >      >>>
     >      >>> Many other languages use various rejection sampling
    approaches
     >     which
     >      >>> provide an unbiased method for sampling, such as in Go,
    python,
     >     and others
     >      >>> described here: https://arxiv.org/abs/1805.10941 (I
    believe the
     >     biased
     >      >>> algorithm currently used in R is also described there).  I'm
     >     not an expert
     >      >>> in this area, but does it make sense for the R to adopt
    one of
     >     the unbiased
     >      >>> random sample algorithms outlined there and used in other
     >     languages?  Would
     >      >>> a patch providing such an algorithm be welcome? What
    concerns
     >     would need to
     >      >>> be addressed first?
     >      >>>
     >      >>> I believe this issue was also raised by Killie & Philip in
     >      >>>
    http://r.789695.n4.nabble.com/Bug-in-sample-td4729483.html, and
     >     more
     >      >>> recently in
     >      >>>
     >
    https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf
    <https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>

> <https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>,

     >      >>> pointing to the python implementation for comparison:
     >      >>>
     >
    
https://github.com/statlab/cryptorandom/blob/master/cryptorandom/cryptorandom.py#L265
     >      >>
     >      >> I think the analyses are correct, but I doubt if a change
    to the
     >     default
     >      >> is likely to be accepted as it would make it more
    difficult to
     >     reproduce
     >      >> older results.
     >      >>
     >      >> On the other hand, a contribution of a new function like
     >     sample() but
     >      >> not suffering from the bias would be good.  The normal way to
     >     make such
     >      >> a contribution is in a user contributed package.
     >      >>
     >      >> By the way, R code illustrating the bias is probably not very
     >     hard to
     >      >> put together.  I believe the bias manifests itself in
    sample()
     >     producing
     >      >> values with two different probabilities (instead of all equal
     >      >> probabilities).  Those may differ by as much as one part in
     >     2^32.  It's
     >      >
     >      > According to Kellie and Philip, in the attachment of the
    thread
     >      > referenced by Carl, "The maximum ratio of selection
    probabilities can
     >      > get as large as 1.5 if n is just below 2^31".
     >
     >     Sorry, I didn't write very well.  I meant to say that the
    difference in
     >     probabilities would be 2^-32, not that the ratio of
    probabilities would
     >     be 1 + 2^-32.
     >
     >     By the way, I don't see the statement giving the ratio as
    1.5, but
     >     maybe
     >     I was looking in the wrong place.  In Theorem 1 of the paper
    I was
     >     looking in the ratio was "1 + m 2^{-w + 1}".  In that formula
    m is your
     >     n.  If it is near 2^31, R uses w = 57 random bits, so the ratio
     >     would be
     >     very, very small (one part in 2^25).
     >

> The worst case for R would happen when m is just below2^25, where w

     >     is at least 31 for the default generators.  In that case the
    ratio
     >     could
     >     be about 1.03.
     >
     >     Duncan Murdoch
     >
     >
     >
     > --
     > Philip B. Stark | Associate Dean, Mathematical and Physical
    Sciences |
     > Professor,  Department of Statistics |
     > University of California
     > Berkeley, CA 94720-3860 | 510-394-5077 |
    statistics.berkeley.edu/~stark
    <http://statistics.berkeley.edu/%7Estark>
     > <http://statistics.berkeley.edu/%7Estark> |
     > @philipbstark
     >



--

Philip B. Stark | Associate Dean, Mathematical and Physical Sciences |Professor, Department of Statistics |

University of California

Berkeley, CA 94720-3860 | 510-394-5077 | statistics.berkeley.edu/~stark<http://statistics.berkeley.edu/%7Estark> |

@philipbstark


______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Bias in R's random integers?

Reply via email to