Re: [Rd] bias issue in sample() (PR 17494)

2019-02-26 Thread Ralf Stubner
Kirill,

I think some level of collision is actually expected! R uses a 32bit MT
that can produce 2^32 different doubles. The probability for a collision
within a million draws is

> pbirthday(1e6, classes = 2^32)
[1] 1

Greetings
Ralf


On 26.02.19 07:06, Kirill Müller wrote:
> Gabe
> 
> 
> As mentioned on Twitter, I think the following behavior should be fixed
> as part of the upcoming changes:
> 
> R.version.string
> ## [1] "R Under development (unstable) (2019-02-25 r76160)"
> .Machine$double.digits
> ## [1] 53
> set.seed(123)
> RNGkind()
> ## [1] "Mersenne-Twister" "Inversion"    "Rejection"
> length(table(runif(1e6)))
> ## [1] 999863
> 
> I don't expect any collisions when using Mersenne-Twister to generate a
> million floating point values. I'm not sure what causes this behavior,
> but it's documented in ?Random:
> 
> "Do not rely on randomness of low-order bits from RNGs. Most of the
> supplied uniform generators return 32-bit integer values that are
> converted to doubles, so they take at most 2^32 distinct values and long
> runs will return duplicated values (Wichmann-Hill is the exception, and
> all give at least 30 varying bits.)"
> 
> The "Wichman-Hill" bit is interesting:
> 
> RNGkind("Wichmann-Hill")
> length(table(runif(1e6)))
> ## [1] 100
> length(table(runif(1e6)))
> ## [1] 100
> 
> Mersenne-Twister has a much much larger periodicity than Wichmann-Hill,
> it would be great to see the above behavior also for Mersenne-Twister.
> Thanks for considering.
> 
> 
> Best regards
> 
> Kirill
> 
> 
> On 20.02.19 08:01, Gabriel Becker wrote:
>> Luke,
>>
>> I'm happy to help with this. Its great to see this get tackled (I've
>> cc'ed
>> Kelli Ottoboni who helped flag this issue).
>>
>> I can prepare a patch for the RNGkind related stuff and the doc update.
>>
>> As for ???, what are your (and others') thoughts about the possibility of
>> a) a reproducibility API which takes either an R version (or maybe
>> alternatively a date) and sets the RNGkind to the default for that
>> version/date, and/or b) that sessionInfo be modified to capture (and
>> display) the RNGkind in effect.
>>
>> Best,
>> ~G
>>
>>
>> On Tue, Feb 19, 2019 at 11:52 AM Tierney, Luke 
>> wrote:
>>
>>> Before the next release we really should to sort out the bias issue in
>>> sample() reported by Ottoboni and Stark in
>>> https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf and
>>> filed aa a bug report by Duncan Murdoch at
>>> https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17494.
>>>
>>> Here are two examples of bad behavior through current R-devel:
>>>
>>>   set.seed(123)
>>>   m <- (2/5) * 2^32
>>>   x <- sample(m, 100, replace = TRUE)
>>>   table(x %% 2, x > m / 2)
>>>   ##
>>>   ##    FALSE   TRUE
>>>   ## 0 300620 198792
>>>   ## 1 200196 300392
>>>
>>>   table(sample(2/7 * 2^32, 100, replace = TRUE) %% 2)
>>>   ##
>>>   ##  0  1
>>>   ## 429054 570946
>>>
>>> I committed a modification to R_unif_index to address this by
>>> generating random bits (blocks of 16) and rejection sampling, but for
>>> now this is only enabled if the environment variable R_NEW_SAMPLE is
>>> set before the first call.
>>>
>>> Some things still needed:
>>>
>>> - someone to look over the change and see if there are any issues
>>> - adjustment of RNGkind to allowing the old behavior to be selected
>>> - make the new behavior the default
>>> - adjust documentation
>>> - ???
>>>
>>> Unfortunately I don't have enough free cycles to do this, but I can
>>> help if someone else can take the lead.
>>>
>>> There are two other places I found that might suffer from the same
>>> issue, in walker_ProbSampleReplace (pointed out bu O & S) and in
>>> src/nmath/wilcox.c.  Both can be addressed by using R_unif_index. I
>>> have done that for walker_ProbSampleReplace, but the wilcox change
>>> might need adjusting to support the standalone math library and I
>>> don't feel confident enough I'd get that right.
>>>
>>> Best,
>>>
>>> luke
>>>
>>>
>>> -- 
>>> Luke Tierney
>>> Ralph E. Wareham Professor of Mathematical Sciences
>>> University of Iowa  Phone: 319-335-3386
>>> Department of Statistics and    Fax:   319-335-3017
>>>  Actuarial Science
>>> 241 Schaeffer Hall  email:   luke-tier...@uiowa.edu
>>> Iowa City, IA 52242 WWW:  http://www.stat.uiowa.edu
>>>
>>> __
>>> R-devel@r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>
>> [[alternative HTML version deleted]]
>>
>> __
>> R-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

-- 
Ralf Stubner
Senior Software Engineer / Trainer

daqana GmbH
Dortustraße 48
14467 Po

Re: [Rd] bias issue in sample() (PR 17494)

2019-02-26 Thread Kirill Müller

Ralf


I don't doubt this is expected with the current implementation, I doubt 
the implementation is desirable. Suggesting to turn this to


pbirthday(1e6, classes = 2^53)
## [1] 5.550956e-05

(which is still non-zero, but much less likely to cause confusion.)


Best regards

Kirill

On 26.02.19 10:18, Ralf Stubner wrote:

Kirill,

I think some level of collision is actually expected! R uses a 32bit MT
that can produce 2^32 different doubles. The probability for a collision
within a million draws is


pbirthday(1e6, classes = 2^32)

[1] 1

Greetings
Ralf


On 26.02.19 07:06, Kirill Müller wrote:

Gabe


As mentioned on Twitter, I think the following behavior should be fixed
as part of the upcoming changes:

R.version.string
## [1] "R Under development (unstable) (2019-02-25 r76160)"
.Machine$double.digits
## [1] 53
set.seed(123)
RNGkind()
## [1] "Mersenne-Twister" "Inversion"    "Rejection"
length(table(runif(1e6)))
## [1] 999863

I don't expect any collisions when using Mersenne-Twister to generate a
million floating point values. I'm not sure what causes this behavior,
but it's documented in ?Random:

"Do not rely on randomness of low-order bits from RNGs. Most of the
supplied uniform generators return 32-bit integer values that are
converted to doubles, so they take at most 2^32 distinct values and long
runs will return duplicated values (Wichmann-Hill is the exception, and
all give at least 30 varying bits.)"

The "Wichman-Hill" bit is interesting:

RNGkind("Wichmann-Hill")
length(table(runif(1e6)))
## [1] 100
length(table(runif(1e6)))
## [1] 100

Mersenne-Twister has a much much larger periodicity than Wichmann-Hill,
it would be great to see the above behavior also for Mersenne-Twister.
Thanks for considering.


Best regards

Kirill


On 20.02.19 08:01, Gabriel Becker wrote:

Luke,

I'm happy to help with this. Its great to see this get tackled (I've
cc'ed
Kelli Ottoboni who helped flag this issue).

I can prepare a patch for the RNGkind related stuff and the doc update.

As for ???, what are your (and others') thoughts about the possibility of
a) a reproducibility API which takes either an R version (or maybe
alternatively a date) and sets the RNGkind to the default for that
version/date, and/or b) that sessionInfo be modified to capture (and
display) the RNGkind in effect.

Best,
~G


On Tue, Feb 19, 2019 at 11:52 AM Tierney, Luke 
wrote:


Before the next release we really should to sort out the bias issue in
sample() reported by Ottoboni and Stark in
https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf and
filed aa a bug report by Duncan Murdoch at
https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17494.

Here are two examples of bad behavior through current R-devel:

   set.seed(123)
   m <- (2/5) * 2^32
   x <- sample(m, 100, replace = TRUE)
   table(x %% 2, x > m / 2)
   ##
   ##    FALSE   TRUE
   ## 0 300620 198792
   ## 1 200196 300392

   table(sample(2/7 * 2^32, 100, replace = TRUE) %% 2)
   ##
   ##  0  1
   ## 429054 570946

I committed a modification to R_unif_index to address this by
generating random bits (blocks of 16) and rejection sampling, but for
now this is only enabled if the environment variable R_NEW_SAMPLE is
set before the first call.

Some things still needed:

- someone to look over the change and see if there are any issues
- adjustment of RNGkind to allowing the old behavior to be selected
- make the new behavior the default
- adjust documentation
- ???

Unfortunately I don't have enough free cycles to do this, but I can
help if someone else can take the lead.

There are two other places I found that might suffer from the same
issue, in walker_ProbSampleReplace (pointed out bu O & S) and in
src/nmath/wilcox.c.  Both can be addressed by using R_unif_index. I
have done that for walker_ProbSampleReplace, but the wilcox change
might need adjusting to support the standalone math library and I
don't feel confident enough I'd get that right.

Best,

luke


--
Luke Tierney
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa  Phone: 319-335-3386
Department of Statistics and    Fax:   319-335-3017
  Actuarial Science
241 Schaeffer Hall  email:   luke-tier...@uiowa.edu
Iowa City, IA 52242 WWW:  http://www.stat.uiowa.edu

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


 [[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] bias issue in sample() (PR 17494)

2019-02-26 Thread Tierney, Luke
On Tue, 26 Feb 2019, Kirill Müller wrote:

> Ralf
>
>
> I don't doubt this is expected with the current implementation, I doubt the 
> implementation is desirable. Suggesting to turn this to
>
> pbirthday(1e6, classes = 2^53)
> ## [1] 5.550956e-05

That isn't a small number given simulation sizes people routinely run
these days. Just about right to miss an issue in a pilot run and get
bitten on the real one.

In the inversion generator for normals we already use a higher
resolution uniform produced from two regular ones. I considered
switching to that approach for all uniforms, either in addition to or
instead of changing the uniform integer sampling algorithm used in
sample(). But that would have been even more disruptive:

- all simulation results (except normals) would change;
- there would be a performance penalty;
- the streams would be used up twice as fast;

I would also probably be necessary to rethink things like how to use
the L'Ecuyer generator to produce multiple streams in the `parallel`
package.

We may need to take this route in the future, but it didn't seem like
a good idea at this time.

Best,

luke

>
> (which is still non-zero, but much less likely to cause confusion.)
>
>
> Best regards
>
> Kirill
>
> On 26.02.19 10:18, Ralf Stubner wrote:
>> Kirill,
>> 
>> I think some level of collision is actually expected! R uses a 32bit MT
>> that can produce 2^32 different doubles. The probability for a collision
>> within a million draws is
>> 
>>> pbirthday(1e6, classes = 2^32)
>> [1] 1
>> 
>> Greetings
>> Ralf
>> 
>> 
>> On 26.02.19 07:06, Kirill Müller wrote:
>>> Gabe
>>> 
>>> 
>>> As mentioned on Twitter, I think the following behavior should be fixed
>>> as part of the upcoming changes:
>>> 
>>> R.version.string
>>> ## [1] "R Under development (unstable) (2019-02-25 r76160)"
>>> .Machine$double.digits
>>> ## [1] 53
>>> set.seed(123)
>>> RNGkind()
>>> ## [1] "Mersenne-Twister" "Inversion"    "Rejection"
>>> length(table(runif(1e6)))
>>> ## [1] 999863
>>> 
>>> I don't expect any collisions when using Mersenne-Twister to generate a
>>> million floating point values. I'm not sure what causes this behavior,
>>> but it's documented in ?Random:
>>> 
>>> "Do not rely on randomness of low-order bits from RNGs. Most of the
>>> supplied uniform generators return 32-bit integer values that are
>>> converted to doubles, so they take at most 2^32 distinct values and long
>>> runs will return duplicated values (Wichmann-Hill is the exception, and
>>> all give at least 30 varying bits.)"
>>> 
>>> The "Wichman-Hill" bit is interesting:
>>> 
>>> RNGkind("Wichmann-Hill")
>>> length(table(runif(1e6)))
>>> ## [1] 100
>>> length(table(runif(1e6)))
>>> ## [1] 100
>>> 
>>> Mersenne-Twister has a much much larger periodicity than Wichmann-Hill,
>>> it would be great to see the above behavior also for Mersenne-Twister.
>>> Thanks for considering.
>>> 
>>> 
>>> Best regards
>>> 
>>> Kirill
>>> 
>>> 
>>> On 20.02.19 08:01, Gabriel Becker wrote:
 Luke,
 
 I'm happy to help with this. Its great to see this get tackled (I've
 cc'ed
 Kelli Ottoboni who helped flag this issue).
 
 I can prepare a patch for the RNGkind related stuff and the doc update.
 
 As for ???, what are your (and others') thoughts about the possibility of
 a) a reproducibility API which takes either an R version (or maybe
 alternatively a date) and sets the RNGkind to the default for that
 version/date, and/or b) that sessionInfo be modified to capture (and
 display) the RNGkind in effect.
 
 Best,
 ~G
 
 
 On Tue, Feb 19, 2019 at 11:52 AM Tierney, Luke 
 wrote:
 
> Before the next release we really should to sort out the bias issue in
> sample() reported by Ottoboni and Stark in
> https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf and
> filed aa a bug report by Duncan Murdoch at
> https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17494.
> 
> Here are two examples of bad behavior through current R-devel:
>
>    set.seed(123)
>    m <- (2/5) * 2^32
>    x <- sample(m, 100, replace = TRUE)
>    table(x %% 2, x > m / 2)
>    ##
>    ##    FALSE   TRUE
>    ## 0 300620 198792
>    ## 1 200196 300392
>
>    table(sample(2/7 * 2^32, 100, replace = TRUE) %% 2)
>    ##
>    ##  0  1
>    ## 429054 570946
> 
> I committed a modification to R_unif_index to address this by
> generating random bits (blocks of 16) and rejection sampling, but for
> now this is only enabled if the environment variable R_NEW_SAMPLE is
> set before the first call.
> 
> Some things still needed:
> 
> - someone to look over the change and see if there are any issues
> - adjustment of RNGkind to allowing the old behavior to be selected
> - make the new behavior the default
> - adjust

[Rd] Possible Update to R-internals Manual

2019-02-26 Thread brodie gaslam via R-devel
According the R-ints the only current uses of the `truelength` meta datum is 
for environment hash tables.  Jim Hester just made me aware that R3.4.0 
introduces a new use case: growable vectors.

I attach a patch to the R-ints manual that reflects this change.  The wording 
is obviously just a suggestion.  Additionally, it may be worth moving the 
footnote into the main body of the document now that there are more use cases.

Best,

Brodie.Index: doc/manual/R-ints.texi
===
--- doc/manual/R-ints.texi  (revision 76152)
+++ doc/manual/R-ints.texi  (working copy)
@@ -366,6 +366,9 @@
 
 Bit 4 is turned on to mark S4 objects.
 
+Bit 5 for vectors is used to indicate that the vector is overallocated
+and thus may be growable without a new allocation.
+
 Bits 1, 2, 3, 5 and 6 are used for a @code{CHARSXP} to denote its
 encoding.  Bit 1 indicates that the @code{CHARSXP} should be treated as
 a set of bytes, not necessarily representing a character in any known
@@ -406,16 +409,19 @@
 types are a @code{VECTOR_SEXPREC}, which again consists of the header
 and the same three pointers, but followed by two integers giving the
 length and `true length'@footnote{This is almost unused.  The only
-current use is for hash tables of environments (@code{VECSXP}s), where
+current uses are for hash tables of environments (@code{VECSXP}s), where
 @code{length} is the size of the table and @code{truelength} is the
-number of primary slots in use, and for the reference hash tables in
+number of primary slots in use, for the reference hash tables in
 serialization (@code{VECSXP}s), where @code{truelength} is the number of
-slots in use.} of the vector, and then followed by the data (aligned as
-required: on most 32-bit systems with a 24-byte @code{VECTOR_SEXPREC}
-node the data can follow immediately after the node).  The data are a
-block of memory of the appropriate length to store `true length'
-elements (rounded up to a multiple of 8 bytes, with the 8-byte blocks
-being the `Vcells' referred in the documentation for @code{gc()}).
+slots in use, and for vectors that are over-allocated due to assignment
+past the original length, where @code{length} is the in-use length and
+@code{truelength} is the allocated length.} of the vector, and then
+followed by the data (aligned as required: on most 32-bit systems with a
+24-byte @code{VECTOR_SEXPREC} node the data can follow immediately after
+the node).  The data are a block of memory of the appropriate length to
+store `true length' elements (rounded up to a multiple of 8 bytes, with
+the 8-byte blocks being the `Vcells' referred in the documentation for
+@code{gc()}).
 
 The `data' for the various types are given in the table below.  A lot of
 this is interpretation, i.e.@: the types are not checked.
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Compile R to WebAssembly / Emscripten?

2019-02-26 Thread Gabriel Becker
As I recall, the major blocker is that R links against a number of other
things (notably BLAS, pcre, etc) so while technically possible (?) I
suppose, the universe of things you'd have to compile over and then get
working is much larger than just the R internals.

I think most people who consider this (including me years ago, as well as
the poster of Gabor's message to rdevel) hit that point and then go try to
find a less herculean task to pursue.

~G

On Wed, Feb 20, 2019 at 12:57 AM Gábor Csárdi 
wrote:

> This was some time ago:
> https://stat.ethz.ch/pipermail/r-devel/2013-May/066724.html
>
> So probably not hopeless, but I would think it is a lot of work.
>
> Gabor
>
> On Wed, Feb 20, 2019 at 8:17 AM Todd Wilder  wrote:
> >
> > Has anyone attempted to compile R (probably without any OS bindings) to
> > WebAssembly / Emscripten? If so, how far did you get? (would be crazy
> > awesome if you could get all the way to a ggplot bitmap output). If not,
> is
> > this a waste of time or is there some daylight to doing this?
> >
> > [[alternative HTML version deleted]]
> >
> > __
> > R-devel@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel