On 19/09/2018 3:52 PM, Philip B. Stark wrote:
Hi Duncan--

Nice simulation!

The absolute difference in probabilities is small, but the maximum relative difference grows from something negligible to almost 2 as m approaches 2**31.

Because the L_1 distance between the uniform distribution on {1, ..., m} and what you actually get is large, there have to be test functions whose expectations are quite different under the two distributions.

That is a mathematically true statement, but I suspect it is not very relevant. Pseudo-random number generators always have test functions whose sample averages are quite different from the expectation under the true distribution. Remember Von Neumann's "state of sin" quote. The bug in sample() just means it is easier to find such a function than it would otherwise be.

The practical question is whether such a function is likely to arise in practice or not.

> Whether those correspond to commonly used statistics or not, I have no
> idea.

I am pretty confident that this bug rarely matters.

Regarding backwards compatibility: as a user, I'd rather the default sample() do the best possible thing, and take an extra step to use something like sample(..., legacy=TRUE) if I want to reproduce old results.

I suspect there's a good chance the bug I discovered today (non-integer x values not being truncated) will be declared to be a feature, and the documentation will be changed. Then the rejection sampling approach would need to be quite a bit more complicated.

I think a documentation warning about the accuracy of sampling probabilities would also be a sufficient fix here, and would be quite a bit less trouble than changing the default sample(). But as I said in my original post, a contribution of a function without this bug would be a nice addition.

Duncan Murdoch


Regards,
Philip

On Wed, Sep 19, 2018 at 9:50 AM Duncan Murdoch <murdoch.dun...@gmail.com <mailto:murdoch.dun...@gmail.com>> wrote:

    On 19/09/2018 12:23 PM, Philip B. Stark wrote:
     > No, the 2nd call only happens when m > 2**31. Here's the code:

    Yes, you're right. Sorry!

    So the ratio really does come close to 2.  However, the difference in
    probabilities between outcomes is still at most 2^-32 when m is less
    than that cutoff.  That's not feasible to detect; the only detectable
    difference would happen if some event was constructed to hold an
    abundance of outcomes with especially low (or especially high)
    probability.

    As I said in my original post, it's probably not hard to construct such
    a thing, but as I've said more recently, it probably wouldn't happen by
    chance.  Here's one attempt to do it:

    Call the values from unif_rand() "the unif_rand() outcomes".  Call the
    values from sample() the sample outcomes.

    It would be easiest to see the error if half of the sample() outcomes
    used two unif_rand() outcomes, and half used just one.  That would mean
    m should be (2/3) * 2^32, but that's too big and would trigger the
    other
    version.

    So how about half use 2 unif_rands(), and half use 3?  That means m =
    (2/5) * 2^32 = 1717986918.  A good guess is that sample() outcomes
    would
    alternate between the two possibilities, so our event could be even
    versus odd outcomes.

    Let's try it:

      > m <- (2/5)*2^32
      > m > 2^31
    [1] FALSE
      > x <- sample(m, 1000000, replace = TRUE)
      > table(x %% 2)

           0      1
    399850 600150

    Since m is an even number, the true proportions of evens and odds
    should
    be exactly 0.5.  That's some pretty strong evidence of the bug in the
    generator.  (Note that the ratio of the observed probabilities is about
    1.5, so I may not be the first person to have done this.)

    I'm still not convinced that there has ever been a simulation run with
    detectable bias compared to Monte Carlo error unless it (like this one)
    was designed specifically to show the problem.

    Duncan Murdoch

     >
     > (RNG.c, lines 793ff)
     >
     > double R_unif_index(double dn)
     > {
     >      double cut = INT_MAX;
     >
     >      switch(RNG_kind) {
     >      case KNUTH_TAOCP:
     >      case USER_UNIF:
     >      case KNUTH_TAOCP2:
     > cut = 33554431.0; /* 2^25 - 1 */
     > break;
     >      default:
     > break;
     >     }
     >
     >      double u = dn > cut ? ru() : unif_rand();
     >      return floor(dn * u);
     > }
     >
     > On Wed, Sep 19, 2018 at 9:20 AM Duncan Murdoch
    <murdoch.dun...@gmail.com <mailto:murdoch.dun...@gmail.com>
     > <mailto:murdoch.dun...@gmail.com
    <mailto:murdoch.dun...@gmail.com>>> wrote:
     >
     >     On 19/09/2018 12:09 PM, Philip B. Stark wrote:
     >      > The 53 bits only encode at most 2^{32} possible values,
    because the
     >      > source of the float is the output of a 32-bit PRNG (the
    obsolete
     >     version
     >      > of MT). 53 bits isn't the relevant number here.
     >
     >     No, two calls to unif_rand() are used.  There are two 32 bit
    values,
     >     but
     >     some of the bits are thrown away.
     >
     >     Duncan Murdoch
     >
     >      >
     >      > The selection ratios can get close to 2. Computer scientists
     >     don't do it
     >      > the way R does, for a reason.
     >      >
     >      > Regards,
     >      > Philip
     >      >
     >      > On Wed, Sep 19, 2018 at 9:05 AM Duncan Murdoch
     >     <murdoch.dun...@gmail.com <mailto:murdoch.dun...@gmail.com>
    <mailto:murdoch.dun...@gmail.com <mailto:murdoch.dun...@gmail.com>>
     >      > <mailto:murdoch.dun...@gmail.com
    <mailto:murdoch.dun...@gmail.com>
     >     <mailto:murdoch.dun...@gmail.com
    <mailto:murdoch.dun...@gmail.com>>>> wrote:
     >      >
     >      >     On 19/09/2018 9:09 AM, Iñaki Ucar wrote:
     >      >      > El mié., 19 sept. 2018 a las 14:43, Duncan Murdoch
     >      >      > (<murdoch.dun...@gmail.com
    <mailto:murdoch.dun...@gmail.com>
     >     <mailto:murdoch.dun...@gmail.com
    <mailto:murdoch.dun...@gmail.com>> <mailto:murdoch.dun...@gmail.com
    <mailto:murdoch.dun...@gmail.com>
     >     <mailto:murdoch.dun...@gmail.com
    <mailto:murdoch.dun...@gmail.com>>>>)
     >      >     escribió:
     >      >      >>
     >      >      >> On 18/09/2018 5:46 PM, Carl Boettiger wrote:
     >      >      >>> Dear list,
     >      >      >>>
     >      >      >>> It looks to me that R samples random integers
    using an
     >      >     intuitive but biased
     >      >      >>> algorithm by going from a random number on [0,1) from
     >     the PRNG
     >      >     to a random
     >      >      >>> integer, e.g.
     >      >      >>>
     >      >
    https://github.com/wch/r-source/blob/tags/R-3-5-1/src/main/RNG.c#L808
     >      >      >>>
     >      >      >>> Many other languages use various rejection sampling
     >     approaches
     >      >     which
     >      >      >>> provide an unbiased method for sampling, such as
    in Go,
     >     python,
     >      >     and others
     >      >      >>> described here: https://arxiv.org/abs/1805.10941 (I
     >     believe the
     >      >     biased
     >      >      >>> algorithm currently used in R is also described
    there).  I'm
     >      >     not an expert
     >      >      >>> in this area, but does it make sense for the R to
    adopt
     >     one of
     >      >     the unbiased
     >      >      >>> random sample algorithms outlined there and used
    in other
     >      >     languages?  Would
     >      >      >>> a patch providing such an algorithm be welcome? What
     >     concerns
     >      >     would need to
     >      >      >>> be addressed first?
     >      >      >>>
     >      >      >>> I believe this issue was also raised by Killie &
    Philip in
     >      >      >>>
     > http://r.789695.n4.nabble.com/Bug-in-sample-td4729483.html, and
     >      >     more
     >      >      >>> recently in
     >      >      >>>
     >      >
     >
    https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf
    <https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>
>  <https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>
     >      >
>  <https://www.stat.berkeley.edu/%7Estark/Preprints/r-random-issues.pdf>,
     >      >      >>> pointing to the python implementation for comparison:
     >      >      >>>
     >      >
     >
    
https://github.com/statlab/cryptorandom/blob/master/cryptorandom/cryptorandom.py#L265
     >      >      >>
     >      >      >> I think the analyses are correct, but I doubt if a
    change
     >     to the
     >      >     default
     >      >      >> is likely to be accepted as it would make it more
     >     difficult to
     >      >     reproduce
     >      >      >> older results.
     >      >      >>
     >      >      >> On the other hand, a contribution of a new
    function like
     >      >     sample() but
     >      >      >> not suffering from the bias would be good.  The
    normal way to
     >      >     make such
     >      >      >> a contribution is in a user contributed package.
     >      >      >>
     >      >      >> By the way, R code illustrating the bias is
    probably not very
     >      >     hard to
     >      >      >> put together.  I believe the bias manifests itself in
     >     sample()
     >      >     producing
     >      >      >> values with two different probabilities (instead
    of all equal
     >      >      >> probabilities).  Those may differ by as much as
    one part in
     >      >     2^32.  It's
     >      >      >
     >      >      > According to Kellie and Philip, in the attachment
    of the
     >     thread
     >      >      > referenced by Carl, "The maximum ratio of selection
     >     probabilities can
     >      >      > get as large as 1.5 if n is just below 2^31".
     >      >
     >      >     Sorry, I didn't write very well.  I meant to say that the
     >     difference in
     >      >     probabilities would be 2^-32, not that the ratio of
     >     probabilities would
     >      >     be 1 + 2^-32.
     >      >
     >      >     By the way, I don't see the statement giving the ratio as
     >     1.5, but
     >      >     maybe
     >      >     I was looking in the wrong place.  In Theorem 1 of the
    paper
     >     I was
     >      >     looking in the ratio was "1 + m 2^{-w + 1}".  In that
    formula
     >     m is your
     >      >     n.  If it is near 2^31, R uses w = 57 random bits, so
    the ratio
     >      >     would be
     >      >     very, very small (one part in 2^25).
     >      >
     >      >     The worst case for R would happen when m  is just below
     >     2^25, where w
     >      >     is at least 31 for the default generators.  In that
    case the
     >     ratio
     >      >     could
     >      >     be about 1.03.
     >      >
     >      >     Duncan Murdoch
     >      >
     >      >
     >      >
     >      > --
     >      > Philip B. Stark | Associate Dean, Mathematical and Physical
     >     Sciences |
     >      > Professor,  Department of Statistics |
     >      > University of California
     >      > Berkeley, CA 94720-3860 | 510-394-5077 |
     > statistics.berkeley.edu/~stark
    <http://statistics.berkeley.edu/%7Estark>
     >     <http://statistics.berkeley.edu/%7Estark>
     >      > <http://statistics.berkeley.edu/%7Estark> |
     >      > @philipbstark
     >      >
     >
     >
     >
     > --
     > Philip B. Stark | Associate Dean, Mathematical and Physical
    Sciences |
     > Professor,  Department of Statistics |
     > University of California
     > Berkeley, CA 94720-3860 | 510-394-5077 |
    statistics.berkeley.edu/~stark
    <http://statistics.berkeley.edu/%7Estark>
     > <http://statistics.berkeley.edu/%7Estark> |
     > @philipbstark
     >



--
Philip B. Stark | Associate Dean, Mathematical and Physical Sciences | Professor,  Department of Statistics |
University of California
Berkeley, CA 94720-3860 | 510-394-5077 | statistics.berkeley.edu/~stark <http://statistics.berkeley.edu/%7Estark> |
@philipbstark


______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to