Re: [Rd] bias issue in sample() (PR 17494)
Kirill, I think some level of collision is actually expected! R uses a 32bit MT that can produce 2^32 different doubles. The probability for a collision within a million draws is > pbirthday(1e6, classes = 2^32) [1] 1 Greetings Ralf On 26.02.19 07:06, Kirill Müller wrote: > Gabe > > > As mentioned on Twitter, I think the following behavior should be fixed > as part of the upcoming changes: > > R.version.string > ## [1] "R Under development (unstable) (2019-02-25 r76160)" > .Machine$double.digits > ## [1] 53 > set.seed(123) > RNGkind() > ## [1] "Mersenne-Twister" "Inversion" "Rejection" > length(table(runif(1e6))) > ## [1] 999863 > > I don't expect any collisions when using Mersenne-Twister to generate a > million floating point values. I'm not sure what causes this behavior, > but it's documented in ?Random: > > "Do not rely on randomness of low-order bits from RNGs. Most of the > supplied uniform generators return 32-bit integer values that are > converted to doubles, so they take at most 2^32 distinct values and long > runs will return duplicated values (Wichmann-Hill is the exception, and > all give at least 30 varying bits.)" > > The "Wichman-Hill" bit is interesting: > > RNGkind("Wichmann-Hill") > length(table(runif(1e6))) > ## [1] 100 > length(table(runif(1e6))) > ## [1] 100 > > Mersenne-Twister has a much much larger periodicity than Wichmann-Hill, > it would be great to see the above behavior also for Mersenne-Twister. > Thanks for considering. > > > Best regards > > Kirill > > > On 20.02.19 08:01, Gabriel Becker wrote: >> Luke, >> >> I'm happy to help with this. Its great to see this get tackled (I've >> cc'ed >> Kelli Ottoboni who helped flag this issue). >> >> I can prepare a patch for the RNGkind related stuff and the doc update. >> >> As for ???, what are your (and others') thoughts about the possibility of >> a) a reproducibility API which takes either an R version (or maybe >> alternatively a date) and sets the RNGkind to the default for that >> version/date, and/or b) that sessionInfo be modified to capture (and >> display) the RNGkind in effect. >> >> Best, >> ~G >> >> >> On Tue, Feb 19, 2019 at 11:52 AM Tierney, Luke >> wrote: >> >>> Before the next release we really should to sort out the bias issue in >>> sample() reported by Ottoboni and Stark in >>> https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf and >>> filed aa a bug report by Duncan Murdoch at >>> https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17494. >>> >>> Here are two examples of bad behavior through current R-devel: >>> >>> set.seed(123) >>> m <- (2/5) * 2^32 >>> x <- sample(m, 100, replace = TRUE) >>> table(x %% 2, x > m / 2) >>> ## >>> ## FALSE TRUE >>> ## 0 300620 198792 >>> ## 1 200196 300392 >>> >>> table(sample(2/7 * 2^32, 100, replace = TRUE) %% 2) >>> ## >>> ## 0 1 >>> ## 429054 570946 >>> >>> I committed a modification to R_unif_index to address this by >>> generating random bits (blocks of 16) and rejection sampling, but for >>> now this is only enabled if the environment variable R_NEW_SAMPLE is >>> set before the first call. >>> >>> Some things still needed: >>> >>> - someone to look over the change and see if there are any issues >>> - adjustment of RNGkind to allowing the old behavior to be selected >>> - make the new behavior the default >>> - adjust documentation >>> - ??? >>> >>> Unfortunately I don't have enough free cycles to do this, but I can >>> help if someone else can take the lead. >>> >>> There are two other places I found that might suffer from the same >>> issue, in walker_ProbSampleReplace (pointed out bu O & S) and in >>> src/nmath/wilcox.c. Both can be addressed by using R_unif_index. I >>> have done that for walker_ProbSampleReplace, but the wilcox change >>> might need adjusting to support the standalone math library and I >>> don't feel confident enough I'd get that right. >>> >>> Best, >>> >>> luke >>> >>> >>> -- >>> Luke Tierney >>> Ralph E. Wareham Professor of Mathematical Sciences >>> University of Iowa Phone: 319-335-3386 >>> Department of Statistics and Fax: 319-335-3017 >>> Actuarial Science >>> 241 Schaeffer Hall email: luke-tier...@uiowa.edu >>> Iowa City, IA 52242 WWW: http://www.stat.uiowa.edu >>> >>> __ >>> R-devel@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-devel >>> >> [[alternative HTML version deleted]] >> >> __ >> R-devel@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel -- Ralf Stubner Senior Software Engineer / Trainer daqana GmbH Dortustraße 48 14467 Po
Re: [Rd] bias issue in sample() (PR 17494)
Ralf I don't doubt this is expected with the current implementation, I doubt the implementation is desirable. Suggesting to turn this to pbirthday(1e6, classes = 2^53) ## [1] 5.550956e-05 (which is still non-zero, but much less likely to cause confusion.) Best regards Kirill On 26.02.19 10:18, Ralf Stubner wrote: Kirill, I think some level of collision is actually expected! R uses a 32bit MT that can produce 2^32 different doubles. The probability for a collision within a million draws is pbirthday(1e6, classes = 2^32) [1] 1 Greetings Ralf On 26.02.19 07:06, Kirill Müller wrote: Gabe As mentioned on Twitter, I think the following behavior should be fixed as part of the upcoming changes: R.version.string ## [1] "R Under development (unstable) (2019-02-25 r76160)" .Machine$double.digits ## [1] 53 set.seed(123) RNGkind() ## [1] "Mersenne-Twister" "Inversion" "Rejection" length(table(runif(1e6))) ## [1] 999863 I don't expect any collisions when using Mersenne-Twister to generate a million floating point values. I'm not sure what causes this behavior, but it's documented in ?Random: "Do not rely on randomness of low-order bits from RNGs. Most of the supplied uniform generators return 32-bit integer values that are converted to doubles, so they take at most 2^32 distinct values and long runs will return duplicated values (Wichmann-Hill is the exception, and all give at least 30 varying bits.)" The "Wichman-Hill" bit is interesting: RNGkind("Wichmann-Hill") length(table(runif(1e6))) ## [1] 100 length(table(runif(1e6))) ## [1] 100 Mersenne-Twister has a much much larger periodicity than Wichmann-Hill, it would be great to see the above behavior also for Mersenne-Twister. Thanks for considering. Best regards Kirill On 20.02.19 08:01, Gabriel Becker wrote: Luke, I'm happy to help with this. Its great to see this get tackled (I've cc'ed Kelli Ottoboni who helped flag this issue). I can prepare a patch for the RNGkind related stuff and the doc update. As for ???, what are your (and others') thoughts about the possibility of a) a reproducibility API which takes either an R version (or maybe alternatively a date) and sets the RNGkind to the default for that version/date, and/or b) that sessionInfo be modified to capture (and display) the RNGkind in effect. Best, ~G On Tue, Feb 19, 2019 at 11:52 AM Tierney, Luke wrote: Before the next release we really should to sort out the bias issue in sample() reported by Ottoboni and Stark in https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf and filed aa a bug report by Duncan Murdoch at https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17494. Here are two examples of bad behavior through current R-devel: set.seed(123) m <- (2/5) * 2^32 x <- sample(m, 100, replace = TRUE) table(x %% 2, x > m / 2) ## ## FALSE TRUE ## 0 300620 198792 ## 1 200196 300392 table(sample(2/7 * 2^32, 100, replace = TRUE) %% 2) ## ## 0 1 ## 429054 570946 I committed a modification to R_unif_index to address this by generating random bits (blocks of 16) and rejection sampling, but for now this is only enabled if the environment variable R_NEW_SAMPLE is set before the first call. Some things still needed: - someone to look over the change and see if there are any issues - adjustment of RNGkind to allowing the old behavior to be selected - make the new behavior the default - adjust documentation - ??? Unfortunately I don't have enough free cycles to do this, but I can help if someone else can take the lead. There are two other places I found that might suffer from the same issue, in walker_ProbSampleReplace (pointed out bu O & S) and in src/nmath/wilcox.c. Both can be addressed by using R_unif_index. I have done that for walker_ProbSampleReplace, but the wilcox change might need adjusting to support the standalone math library and I don't feel confident enough I'd get that right. Best, luke -- Luke Tierney Ralph E. Wareham Professor of Mathematical Sciences University of Iowa Phone: 319-335-3386 Department of Statistics and Fax: 319-335-3017 Actuarial Science 241 Schaeffer Hall email: luke-tier...@uiowa.edu Iowa City, IA 52242 WWW: http://www.stat.uiowa.edu __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] bias issue in sample() (PR 17494)
On Tue, 26 Feb 2019, Kirill Müller wrote: > Ralf > > > I don't doubt this is expected with the current implementation, I doubt the > implementation is desirable. Suggesting to turn this to > > pbirthday(1e6, classes = 2^53) > ## [1] 5.550956e-05 That isn't a small number given simulation sizes people routinely run these days. Just about right to miss an issue in a pilot run and get bitten on the real one. In the inversion generator for normals we already use a higher resolution uniform produced from two regular ones. I considered switching to that approach for all uniforms, either in addition to or instead of changing the uniform integer sampling algorithm used in sample(). But that would have been even more disruptive: - all simulation results (except normals) would change; - there would be a performance penalty; - the streams would be used up twice as fast; I would also probably be necessary to rethink things like how to use the L'Ecuyer generator to produce multiple streams in the `parallel` package. We may need to take this route in the future, but it didn't seem like a good idea at this time. Best, luke > > (which is still non-zero, but much less likely to cause confusion.) > > > Best regards > > Kirill > > On 26.02.19 10:18, Ralf Stubner wrote: >> Kirill, >> >> I think some level of collision is actually expected! R uses a 32bit MT >> that can produce 2^32 different doubles. The probability for a collision >> within a million draws is >> >>> pbirthday(1e6, classes = 2^32) >> [1] 1 >> >> Greetings >> Ralf >> >> >> On 26.02.19 07:06, Kirill Müller wrote: >>> Gabe >>> >>> >>> As mentioned on Twitter, I think the following behavior should be fixed >>> as part of the upcoming changes: >>> >>> R.version.string >>> ## [1] "R Under development (unstable) (2019-02-25 r76160)" >>> .Machine$double.digits >>> ## [1] 53 >>> set.seed(123) >>> RNGkind() >>> ## [1] "Mersenne-Twister" "Inversion" "Rejection" >>> length(table(runif(1e6))) >>> ## [1] 999863 >>> >>> I don't expect any collisions when using Mersenne-Twister to generate a >>> million floating point values. I'm not sure what causes this behavior, >>> but it's documented in ?Random: >>> >>> "Do not rely on randomness of low-order bits from RNGs. Most of the >>> supplied uniform generators return 32-bit integer values that are >>> converted to doubles, so they take at most 2^32 distinct values and long >>> runs will return duplicated values (Wichmann-Hill is the exception, and >>> all give at least 30 varying bits.)" >>> >>> The "Wichman-Hill" bit is interesting: >>> >>> RNGkind("Wichmann-Hill") >>> length(table(runif(1e6))) >>> ## [1] 100 >>> length(table(runif(1e6))) >>> ## [1] 100 >>> >>> Mersenne-Twister has a much much larger periodicity than Wichmann-Hill, >>> it would be great to see the above behavior also for Mersenne-Twister. >>> Thanks for considering. >>> >>> >>> Best regards >>> >>> Kirill >>> >>> >>> On 20.02.19 08:01, Gabriel Becker wrote: Luke, I'm happy to help with this. Its great to see this get tackled (I've cc'ed Kelli Ottoboni who helped flag this issue). I can prepare a patch for the RNGkind related stuff and the doc update. As for ???, what are your (and others') thoughts about the possibility of a) a reproducibility API which takes either an R version (or maybe alternatively a date) and sets the RNGkind to the default for that version/date, and/or b) that sessionInfo be modified to capture (and display) the RNGkind in effect. Best, ~G On Tue, Feb 19, 2019 at 11:52 AM Tierney, Luke wrote: > Before the next release we really should to sort out the bias issue in > sample() reported by Ottoboni and Stark in > https://www.stat.berkeley.edu/~stark/Preprints/r-random-issues.pdf and > filed aa a bug report by Duncan Murdoch at > https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17494. > > Here are two examples of bad behavior through current R-devel: > > set.seed(123) > m <- (2/5) * 2^32 > x <- sample(m, 100, replace = TRUE) > table(x %% 2, x > m / 2) > ## > ## FALSE TRUE > ## 0 300620 198792 > ## 1 200196 300392 > > table(sample(2/7 * 2^32, 100, replace = TRUE) %% 2) > ## > ## 0 1 > ## 429054 570946 > > I committed a modification to R_unif_index to address this by > generating random bits (blocks of 16) and rejection sampling, but for > now this is only enabled if the environment variable R_NEW_SAMPLE is > set before the first call. > > Some things still needed: > > - someone to look over the change and see if there are any issues > - adjustment of RNGkind to allowing the old behavior to be selected > - make the new behavior the default > - adjust
[Rd] Possible Update to R-internals Manual
According the R-ints the only current uses of the `truelength` meta datum is for environment hash tables. Jim Hester just made me aware that R3.4.0 introduces a new use case: growable vectors. I attach a patch to the R-ints manual that reflects this change. The wording is obviously just a suggestion. Additionally, it may be worth moving the footnote into the main body of the document now that there are more use cases. Best, Brodie.Index: doc/manual/R-ints.texi === --- doc/manual/R-ints.texi (revision 76152) +++ doc/manual/R-ints.texi (working copy) @@ -366,6 +366,9 @@ Bit 4 is turned on to mark S4 objects. +Bit 5 for vectors is used to indicate that the vector is overallocated +and thus may be growable without a new allocation. + Bits 1, 2, 3, 5 and 6 are used for a @code{CHARSXP} to denote its encoding. Bit 1 indicates that the @code{CHARSXP} should be treated as a set of bytes, not necessarily representing a character in any known @@ -406,16 +409,19 @@ types are a @code{VECTOR_SEXPREC}, which again consists of the header and the same three pointers, but followed by two integers giving the length and `true length'@footnote{This is almost unused. The only -current use is for hash tables of environments (@code{VECSXP}s), where +current uses are for hash tables of environments (@code{VECSXP}s), where @code{length} is the size of the table and @code{truelength} is the -number of primary slots in use, and for the reference hash tables in +number of primary slots in use, for the reference hash tables in serialization (@code{VECSXP}s), where @code{truelength} is the number of -slots in use.} of the vector, and then followed by the data (aligned as -required: on most 32-bit systems with a 24-byte @code{VECTOR_SEXPREC} -node the data can follow immediately after the node). The data are a -block of memory of the appropriate length to store `true length' -elements (rounded up to a multiple of 8 bytes, with the 8-byte blocks -being the `Vcells' referred in the documentation for @code{gc()}). +slots in use, and for vectors that are over-allocated due to assignment +past the original length, where @code{length} is the in-use length and +@code{truelength} is the allocated length.} of the vector, and then +followed by the data (aligned as required: on most 32-bit systems with a +24-byte @code{VECTOR_SEXPREC} node the data can follow immediately after +the node). The data are a block of memory of the appropriate length to +store `true length' elements (rounded up to a multiple of 8 bytes, with +the 8-byte blocks being the `Vcells' referred in the documentation for +@code{gc()}). The `data' for the various types are given in the table below. A lot of this is interpretation, i.e.@: the types are not checked. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Compile R to WebAssembly / Emscripten?
As I recall, the major blocker is that R links against a number of other things (notably BLAS, pcre, etc) so while technically possible (?) I suppose, the universe of things you'd have to compile over and then get working is much larger than just the R internals. I think most people who consider this (including me years ago, as well as the poster of Gabor's message to rdevel) hit that point and then go try to find a less herculean task to pursue. ~G On Wed, Feb 20, 2019 at 12:57 AM Gábor Csárdi wrote: > This was some time ago: > https://stat.ethz.ch/pipermail/r-devel/2013-May/066724.html > > So probably not hopeless, but I would think it is a lot of work. > > Gabor > > On Wed, Feb 20, 2019 at 8:17 AM Todd Wilder wrote: > > > > Has anyone attempted to compile R (probably without any OS bindings) to > > WebAssembly / Emscripten? If so, how far did you get? (would be crazy > > awesome if you could get all the way to a ggplot bitmap output). If not, > is > > this a waste of time or is there some daylight to doing this? > > > > [[alternative HTML version deleted]] > > > > __ > > R-devel@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-devel > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel