Re: [Rd] Bias in R's random integers?

2018-09-21 Thread Ralf Stubner
On 9/20/18 5:15 PM, Duncan Murdoch wrote:
> On 20/09/2018 6:59 AM, Ralf Stubner wrote:
>> It is difficult to do this in a package, since R does not provide access
>> to the random bits generated by the RNG. Only a float in (0,1) is
>> available via unif_rand(). 
> 
> I believe it is safe to multiply the unif_rand() value by 2^32, and take
> the whole number part as an unsigned 32 bit integer.  Depending on the
> RNG in use, that will give at least 25 random bits.  (The low order bits
> are the questionable ones.  25 is just a guess, not a guarantee.)

Right, the RNGs in R produce no more than 32bits, so the conversion to a
double can be reverted. If we ignore those RNGs that produce less than
32bits for the moment, then the attached file contains a sample
implementation (without long vectors, weighted sampling or hashing). It
uses Rcpp for convenience, but I have tried to keep the C++ low.
Interesting results:

The results for "simple" sampling are the same.

> set.seed(42)

> sample.int(6, 10, replace = TRUE)
 [1] 6 6 2 5 4 4 5 1 4 5

> sample.int(100, 10)
 [1] 46 72 92 25 45 90 98 11 44 51

> set.seed(42)

> sample_int(6, 10, replace = TRUE)
 [1] 6 6 2 5 4 4 5 1 4 5

> sample_int(100, 10)
 [1] 46 72 92 25 45 90 98 11 44 51


But there is no bias with the alternative method:

> m <- ceiling((2/5)*2^32)

> set.seed(42)

> x <- sample.int(m, 100, replace = TRUE)

> table(x %% 2)

 0  1
467768 532232

> set.seed(42)

> y <- sample_int(m, 100, replace = TRUE)

> table(y %% 2)

 0  1
500586 499414


The differences are also visible when sampling only a few values from
'm' possible values:

> set.seed(42)

> sample.int(m, 6, replace = TRUE)
[1] 1571624817 1609883303  491583978 1426698159 1102510407  891800051

> set.seed(42)

> sample_int(m, 6, replace = TRUE)
[1]  491583978 1426698159 1102510407  891800051 1265449090  231355453


When sampling from 'm', performance is not so good since we often have
to get a second random number:

> bench::mark(orig = sample.int(m, 100, replace = TRUE),
+ new  = sample_int(m, 100, replace = TRUE),
+ check = FALSE)
# A tibble: 2 x 14
  expression minmean  median   max `itr/sec` mem_alloc  n_gc n_itr
   
1 orig8.15ms  8.67ms  8.43ms  10ms 115. 3.82MB 452
2 new25.21ms 25.58ms 25.45ms  27ms  39.13.82MB 218
# ... with 5 more variables: total_time , result , memory
,
#   time , gc 


When sampling from fewer values, the difference is much less pronounced:

> bench::mark(orig = sample.int(6, 100, replace = TRUE),
+ new  = sample_int(6, 100, replace = TRUE),
+ check = FALSE)
# A tibble: 2 x 14
  expression minmean  median max `itr/sec` mem_alloc  n_gc n_itr
   
1 orig8.14ms  8.44ms  8.29ms  9.58ms 118. 3.82MB 454
2 new11.13ms 11.66ms 11.23ms 12.98ms  85.83.82MB 339
# ... with 5 more variables: total_time , result , memory
,
#   time , gc 


> Another useful diagnostic is
> 
>   plot(density(y[y %% 2 == 0]))
> 
> Obviously that should give a more or less uniform density, but for
> values near m, the default sample() gives some nice pretty pictures of
> quite non-uniform densities.

Indeed. Adding/subtracting numbers < 10 to/from 'm'  gives "interesting"
curves.

> By the way, there are actually quite a few examples of very large m
> besides m = (2/5)*2^32 where performance of sample() is noticeably bad.
> You'll see problems in y %% 2 for any integer a > 1 with m = 2/(1 + 2a)
> * 2^32, problems in y %% 3 for m = 3/(1 + 3a)*2^32 or m = 3/(2 +
> 3a)*2^32, etc.
> 
> So perhaps I'm starting to be convinced that the default sample() should
> be fixed.

I have the impression that Lemire's method gives the same results unless
it is correcting for the bias that exists in the current method. If that
is really the case, then the disruption should be rather minor. The
ability to fall back to the old behavior would still be useful, though.

cheerio
ralf

-- 
Ralf Stubner
Senior Software Engineer / Trainer

daqana GmbH
Dortustraße 48
14467 Potsdam

T: +49 331 23 61 93 11
F: +49 331 23 61 93 90
M: +49 162 20 91 196
Mail: ralf.stub...@daqana.com

Sitz: Potsdam
Register: AG Potsdam HRB 27966 P
Ust.-IdNr.: DE300072622
Geschäftsführer: Prof. Dr. Dr. Karl-Kuno Kunze


signature.asc
Description: OpenPGP digital signature
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Bias in R's random integers?

2018-09-21 Thread Steve Grubb
Hello,

Top posting. Several people have asked about the code to replicate my 
results. I have cleaned up the code to remove an x/y coordinate bias for 
displaying the results directly on a 640 x 480 VGA adapter. You can find the 
code here:

http://people.redhat.com/sgrubb/files/vseq.c

To collect R samples:
X <- runif(1, min = 0, max = 65535)
write.table(X, file = "~/r-rand.txt", sep = "\n", row.names = FALSE)

Then:
cat ~/r-rand.txt | ./vseq > ~/r-rand.csv

And then to create the chart:

library(ggplot2);
num.csv <- read.csv("~/random.csv", header=T)
qplot(X, Y, data=num.csv);

Hope this helps sort this out.

Best Regards,
-Steve

On Thursday, September 20, 2018 5:09:23 PM EDT Steve Grubb wrote:
> On Thursday, September 20, 2018 11:15:04 AM EDT Duncan Murdoch wrote:
> > On 20/09/2018 6:59 AM, Ralf Stubner wrote:
> > > On 9/20/18 1:43 AM, Carl Boettiger wrote:
> > >> For a well-tested C algorithm, based on my reading of Lemire, the
> > >> unbiased "algorithm 3" in https://arxiv.org/abs/1805.10941 is part
> > >> already of the C standard library in OpenBSD and macOS (as
> > >> arc4random_uniform), and in the GNU standard library.  Lemire also
> > >> provides C++ code in the appendix of his piece for both this and the
> > >> faster "nearly divisionless" algorithm.
> > >> 
> > >> It would be excellent if any R core members were interested in
> > >> considering bindings to these algorithms as a patch, or might express
> > >> expectations for how that patch would have to operate (e.g. re
> > >> Duncan's
> > >> comment about non-integer arguments to sample size).  Otherwise, an R
> > >> package binding seems like a good starting point, but I'm not the
> > >> right
> > >> volunteer.
> > > 
> > > It is difficult to do this in a package, since R does not provide
> > > access
> > > to the random bits generated by the RNG. Only a float in (0,1) is
> > > available via unif_rand().
> > 
> > I believe it is safe to multiply the unif_rand() value by 2^32, and take
> > the whole number part as an unsigned 32 bit integer.  Depending on the
> > RNG in use, that will give at least 25 random bits.  (The low order bits
> > are the questionable ones.  25 is just a guess, not a guarantee.)
> > 
> > However, if one is willing to use an external
> > 
> > > RNG, it is of course possible. After reading about Lemire's work [1], I
> > > had planned to integrate such an unbiased sampling scheme into the
> > > dqrng
> > > package, which I have now started. [2]
> > > 
> > > Using Duncan's example, the results look much better:
> > >> library(dqrng)
> > >> m <- (2/5)*2^32
> > >> y <- dqsample(m, 100, replace = TRUE)
> > >> table(y %% 2)
> > >> 
> > >   0  1
> > > 
> > > 500252 499748
> > 
> > Another useful diagnostic is
> > 
> >plot(density(y[y %% 2 == 0]))
> > 
> > Obviously that should give a more or less uniform density, but for
> > values near m, the default sample() gives some nice pretty pictures of
> > quite non-uniform densities.
> > 
> > By the way, there are actually quite a few examples of very large m
> > besides m = (2/5)*2^32 where performance of sample() is noticeably bad.
> > You'll see problems in y %% 2 for any integer a > 1 with m = 2/(1 + 2a)
> > * 2^32, problems in y %% 3 for m = 3/(1 + 3a)*2^32 or m = 3/(2 +
> > 3a)*2^32, etc.
> > 
> > So perhaps I'm starting to be convinced that the default sample() should
> > be fixed.
> 
> I find this discussion fascinating. I normally test random numbers in
> different languages every now and again using various methods. One simple
> check that I do is to use Michal Zalewski's method when he studied Strange
> Attractors and Initial TCP/IP Sequence Numbers:
> 
> http://lcamtuf.coredump.cx/newtcp/
> https://pdfs.semanticscholar.org/
> adb7/069984e3fa48505cd5081ec118ccb95529a3.pdf
> 
> The technique works by mapping the dynamics of the generated numbers into a
> three-dimensional phase space. This is then plotted in a graph so that you
> can visually see if something odd is going on.
> 
> I used   runif(1, min = 0, max = 65535)  to get a set of numbers. This
> is the resulting plot that was generated from R's numbers using this
> technique:
> 
> http://people.redhat.com/sgrubb/files/r-random.jpg
> 
> And for comparison this was generated by collecting the same number of
> samples from the bash shell:
> 
> http://people.redhat.com/sgrubb/files/bash-random.jpg
> 
> The net result is that it shows some banding in the R generated random
> numbers where bash has uniform random numbers with no discernible pattern.
> 
> Best Regards,
> -Steve
> 
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Bias in R's random integers?

2018-09-21 Thread Radford Neal
> Duncan Murdoch:
>
> and you can see it in the original m with
> 
>x <- sample(m, 100, replace = TRUE)
>plot(density(x[x %% 2 == 0]))

OK.  Thanks.  I see there is a real problem.

One option to fix it while mostly retaining backwards-compatibility
would be to add extra bits from a second RNG call only when m is large
- eg, larger than 2^27.  That would retain reproducibility for most
analyses of small to moderate size data sets.  Of course, there would
still be some small, detectable error for values a bit less than 2^27,
but perhaps that's tolerable.  (The 2^27 threshold could obviously be
debated.)

R Core made a similar decision in the case of sampling with
replacement when implementing a new hashing algorithm that produces
different results.  It is enabled by default only when m > 1e7 and no
more than half the values are to be sampled, as was noted:

> Note that it wouldn't be the first time that sample() changes behavior
> in a non-backward compatible way:
>
>   https://stat.ethz.ch/pipermail/r-devel/2012-October/065049.html
>
> Cheers,
> H.

That incompatibility could have been avoided.  A year ago I posted
a fast hashing algorithm that produces the same results as the simple
algorithm, here:

  https://stat.ethz.ch/pipermail/r-devel/2017-October/075012.html

The latest version of this will be in the soon-to-be new release of
pqR, and will of course enabled automatically whenever it seems
desirable, for a considerable speed gain in many cases.

  Radford Neal

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Bias in R's random integers?

2018-09-21 Thread Dirk Eddelbuettel


On 21 September 2018 at 09:50, Steve Grubb wrote:
| Hello,
| 
| Top posting. Several people have asked about the code to replicate my 
| results. I have cleaned up the code to remove an x/y coordinate bias for 
| displaying the results directly on a 640 x 480 VGA adapter. You can find the 
| code here:
| 
| http://people.redhat.com/sgrubb/files/vseq.c
| 
| To collect R samples:
| X <- runif(1, min = 0, max = 65535)
| write.table(X, file = "~/r-rand.txt", sep = "\n", row.names = FALSE)
| 
| Then:
| cat ~/r-rand.txt | ./vseq > ~/r-rand.csv
| 
| And then to create the chart:
| 
| library(ggplot2);
| num.csv <- read.csv("~/random.csv", header=T)
| qplot(X, Y, data=num.csv);
| 
| Hope this helps sort this out.

Here is a simpler version in one file, combining the operations in a bit of
C++ and also an auto-executed piece of R to create the data, call the
transformation you created, and plot.  Store as eg /tmp/stevegrubb.cpp
and then call   Rcpp::sourceCpp("/tmp/stevegrubb.cpp")

Dirk

--- snip 
#include 

#define SCALING 0.5

// [[Rcpp::export]]
Rcpp::NumericMatrix stevePlot(Rcpp::NumericVector itable) {
  size_t cnt = itable.size();
  Rcpp::NumericMatrix M(cnt,2);
  size_t num = 0;
  double nth, n1th = 0, n2th = 0, n3th = 0;
  double x, y;
  double usex, usey, pha = 0;

  
  usex = sin(pha);
  usey = cos(pha);
  while (num < cnt) {
double x1, y1, z1;
nth = itable[num];
x1 = (nth - n1th)  * SCALING;
y1 = (n1th - n2th) * SCALING;
z1 = (n2th - n3th) * SCALING;
x = (x1*usey+z1*usex);
y = y1 + (z1*usey-x1*usex) / 2;
if (num > 4) {
  M(num,0) = x;
  M(num,1) = y;
} else {
  M(num,0) = M(num,1) = NA_REAL;
}
num++;
n3th=n2th; n2th=n1th; n1th=nth;
  }
  Rcpp::colnames(M) = Rcpp::CharacterVector::create("X", "Y");
  return(M);
}

/*** R
library(ggplot2);
Xin <- runif(1, min = 0, max = 65535)
Xout <- stevePlot(Xin);
qplot(X, Y, data=as.data.frame(Xout));
*/
--- snip 


-- 
http://dirk.eddelbuettel.com | @eddelbuettel | e...@debian.org

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Bias in R's random integers?

2018-09-21 Thread Dirk Eddelbuettel


Slightly nicer version:

--- snip 
#include 

// [[Rcpp::export]]
Rcpp::DataFrame stevePlot(Rcpp::NumericVector itable) {
  size_t cnt = itable.size(), num = 0;
  double nth, n1th = 0, n2th = 0, n3th = 0;
  double x, y, usex, usey, pha = 0;
  std::vector xv, yv;
  const double scaling = 0.5;
  usex = sin(pha);
  usey = cos(pha);
  while (num < cnt) {
double x1, y1, z1;
nth = itable[num];
num++;
x1 = (nth - n1th)  * scaling;
y1 = (n1th - n2th) * scaling;
z1 = (n2th - n3th) * scaling;
x = (x1*usey+z1*usex);
y = y1 + (z1*usey-x1*usex) / 2;
if (num > 4) {
  xv.push_back(x);
  yv.push_back(y);
}
n3th=n2th; n2th=n1th; n1th=nth;
  }
  return(Rcpp::DataFrame::create(Rcpp::Named("X") = xv,
 Rcpp::Named("Y") = yv));
}

/*** R
library(ggplot2)
Xin <- runif(1, min = 0, max = 65535)
Xout <- stevePlot(Xin)
qplot(X, Y, data=Xout)
*/
--- snip 

Still just   Rcpp::sourceCpp("filenamehere.cpp")   it.

Dirk

-- 
http://dirk.eddelbuettel.com | @eddelbuettel | e...@debian.org

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Bias in R's random integers?

2018-09-21 Thread Tierney, Luke
Not sure what should happen theoretically for the code in vseq.c, but
I see the same pattern with the R generators I tried (default,
Super-Duper, and L'Ecuyer) and with with bash $RANDOM using

N <- 1
X1 <- replicate(N, as.integer(system("bash -c 'echo $RANDOM'", intern = TRUE)))
X2 <- replicate(N, as.integer(system("bash -c 'echo $RANDOM'", intern = TRUE)))
X <- X1 + 2 ^ 15 * (X2 > 2^14)

and with numbers from random.org

library(random)
X <- randomNumbers(N, 0, 2^16-1, col = 1)

So I'm not convinced there is an issue.

Best,

luke

On Fri, 21 Sep 2018, Steve Grubb wrote:

> Hello,
>
> Top posting. Several people have asked about the code to replicate my
> results. I have cleaned up the code to remove an x/y coordinate bias for
> displaying the results directly on a 640 x 480 VGA adapter. You can find the
> code here:
>
> http://people.redhat.com/sgrubb/files/vseq.c
>
> To collect R samples:
> X <- runif(1, min = 0, max = 65535)
> write.table(X, file = "~/r-rand.txt", sep = "\n", row.names = FALSE)
>
> Then:
> cat ~/r-rand.txt | ./vseq > ~/r-rand.csv
>
> And then to create the chart:
>
> library(ggplot2);
> num.csv <- read.csv("~/random.csv", header=T)
> qplot(X, Y, data=num.csv);
>
> Hope this helps sort this out.
>
> Best Regards,
> -Steve
>
> On Thursday, September 20, 2018 5:09:23 PM EDT Steve Grubb wrote:
>> On Thursday, September 20, 2018 11:15:04 AM EDT Duncan Murdoch wrote:
>>> On 20/09/2018 6:59 AM, Ralf Stubner wrote:
 On 9/20/18 1:43 AM, Carl Boettiger wrote:
> For a well-tested C algorithm, based on my reading of Lemire, the
> unbiased "algorithm 3" in https://arxiv.org/abs/1805.10941 is part
> already of the C standard library in OpenBSD and macOS (as
> arc4random_uniform), and in the GNU standard library.  Lemire also
> provides C++ code in the appendix of his piece for both this and the
> faster "nearly divisionless" algorithm.
>
> It would be excellent if any R core members were interested in
> considering bindings to these algorithms as a patch, or might express
> expectations for how that patch would have to operate (e.g. re
> Duncan's
> comment about non-integer arguments to sample size).  Otherwise, an R
> package binding seems like a good starting point, but I'm not the
> right
> volunteer.

 It is difficult to do this in a package, since R does not provide
 access
 to the random bits generated by the RNG. Only a float in (0,1) is
 available via unif_rand().
>>>
>>> I believe it is safe to multiply the unif_rand() value by 2^32, and take
>>> the whole number part as an unsigned 32 bit integer.  Depending on the
>>> RNG in use, that will give at least 25 random bits.  (The low order bits
>>> are the questionable ones.  25 is just a guess, not a guarantee.)
>>>
>>> However, if one is willing to use an external
>>>
 RNG, it is of course possible. After reading about Lemire's work [1], I
 had planned to integrate such an unbiased sampling scheme into the
 dqrng
 package, which I have now started. [2]

 Using Duncan's example, the results look much better:
> library(dqrng)
> m <- (2/5)*2^32
> y <- dqsample(m, 100, replace = TRUE)
> table(y %% 2)
>
   0  1

 500252 499748
>>>
>>> Another useful diagnostic is
>>>
>>>plot(density(y[y %% 2 == 0]))
>>>
>>> Obviously that should give a more or less uniform density, but for
>>> values near m, the default sample() gives some nice pretty pictures of
>>> quite non-uniform densities.
>>>
>>> By the way, there are actually quite a few examples of very large m
>>> besides m = (2/5)*2^32 where performance of sample() is noticeably bad.
>>> You'll see problems in y %% 2 for any integer a > 1 with m = 2/(1 + 2a)
>>> * 2^32, problems in y %% 3 for m = 3/(1 + 3a)*2^32 or m = 3/(2 +
>>> 3a)*2^32, etc.
>>>
>>> So perhaps I'm starting to be convinced that the default sample() should
>>> be fixed.
>>
>> I find this discussion fascinating. I normally test random numbers in
>> different languages every now and again using various methods. One simple
>> check that I do is to use Michal Zalewski's method when he studied Strange
>> Attractors and Initial TCP/IP Sequence Numbers:
>>
>> http://lcamtuf.coredump.cx/newtcp/
>> https://pdfs.semanticscholar.org/
>> adb7/069984e3fa48505cd5081ec118ccb95529a3.pdf
>>
>> The technique works by mapping the dynamics of the generated numbers into a
>> three-dimensional phase space. This is then plotted in a graph so that you
>> can visually see if something odd is going on.
>>
>> I used   runif(1, min = 0, max = 65535)  to get a set of numbers. This
>> is the resulting plot that was generated from R's numbers using this
>> technique:
>>
>> http://people.redhat.com/sgrubb/files/r-random.jpg
>>
>> And for comparison this was generated by collecting the same number of
>> samples from the bash shell:
>>
>> http://people.redhat.com/sgrubb/files/bash-random.jpg
>>
>> The 

Re: [Rd] Bias in R's random integers?

2018-09-21 Thread Ralf Stubner
On 9/21/18 6:38 PM, Tierney, Luke wrote:
> Not sure what should happen theoretically for the code in vseq.c, but
> I see the same pattern with the R generators I tried (default,
> Super-Duper, and L'Ecuyer) and with with bash $RANDOM using
> 
> N <- 1
> X1 <- replicate(N, as.integer(system("bash -c 'echo $RANDOM'", intern = 
> TRUE)))
> X2 <- replicate(N, as.integer(system("bash -c 'echo $RANDOM'", intern = 
> TRUE)))
> X <- X1 + 2 ^ 15 * (X2 > 2^14)
> 
> and with numbers from random.org
> 
> library(random)
> X <- randomNumbers(N, 0, 2^16-1, col = 1)
> 
> So I'm not convinced there is an issue.
There is an issue, but it is in vseq.c.

The plot I found striking was this:

http://people.redhat.com/sgrubb/files/r-random.jpg

It shows a scatter plot that is bounded to some rectangle where the
upper right and lower left corner are empty. Roughly speaking, X and Y
correspond to *consecutive differences* between random draws. It is
obvious that differences between random draws are bounded by the range
of the RNG, and that there cannot be two *differences in a row* that are
close to the maximum (or minimum). Hence the expected shape for such a
scatter plot is a rectangle with two corners being forbidden.

Within the allowed region, there should be no structure what so ever
(given enough draws). And that was striking about the above picture: It
showed clear vertical bands which should not be there. MT does fail some
statistical tests, but it cannot be brought down that easily.

Interestingly, I first used Dirk's C++ function for convenience, and
that did *not* show these bands. But when I compiled vseq.c I could
reproduce this. To cut this short: There is an error in vseq.c when the
numbers are read in:

tmp = strtoul(buf, NULL, 16);

The third argument to strtoul is the base in which the numbers should be
interpreted. However, R has written numbers with base 10. Those can be
interpreted as base 16, but they will mean something different. Once one
changes the above line to

tmp = strtoul(buf, NULL, 10);

the bands do disappear.

cheerio
ralf

-- 
Ralf Stubner
Senior Software Engineer / Trainer

daqana GmbH
Dortustraße 48
14467 Potsdam

T: +49 331 23 61 93 11
F: +49 331 23 61 93 90
M: +49 162 20 91 196
Mail: ralf.stub...@daqana.com

Sitz: Potsdam
Register: AG Potsdam HRB 27966 P
Ust.-IdNr.: DE300072622
Geschäftsführer: Prof. Dr. Dr. Karl-Kuno Kunze



signature.asc
Description: OpenPGP digital signature
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Bias in R's random integers?

2018-09-21 Thread Steve Grubb
On Friday, September 21, 2018 12:38:15 PM EDT Tierney, Luke wrote:
> Not sure what should happen theoretically for the code in vseq.c, but
> I see the same pattern with the R generators I tried (default,
> Super-Duper, and L'Ecuyer) and with with bash $RANDOM using
> 
> N <- 1
> X1 <- replicate(N, as.integer(system("bash -c 'echo $RANDOM'", intern =
> TRUE))) X2 <- replicate(N, as.integer(system("bash -c 'echo $RANDOM'",
> intern = TRUE))) X <- X1 + 2 ^ 15 * (X2 > 2^14)
> 
> and with numbers from random.org
> 
> library(random)
> X <- randomNumbers(N, 0, 2^16-1, col = 1)
> 
> So I'm not convinced there is an issue.

Indeed. I found the issue: Hex vs Decimal output. If the bash command was
printf "0x%08x\n" $RANDOM, then everything is fine. So, the code in vseq.c 
needs to be  strtoul(buf, NULL, 0);  so that it handles Hex, Decimal, or 
Octal. With this correction to vseq all is fine. No banding in any generated 
image. Corrected program is uploaded.

So, the technique works. When the numbers had a problem, due to a conversion 
error, it showed a distinct pattern. :-)

Best Regards,
-Steve

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Bias in R's random integers?

2018-09-21 Thread Steve Grubb
On Friday, September 21, 2018 5:28:38 PM EDT Ralf Stubner wrote:
> On 9/21/18 6:38 PM, Tierney, Luke wrote:
> > Not sure what should happen theoretically for the code in vseq.c, but
> > I see the same pattern with the R generators I tried (default,
> > Super-Duper, and L'Ecuyer) and with with bash $RANDOM using
> > 
> > N <- 1
> > X1 <- replicate(N, as.integer(system("bash -c 'echo $RANDOM'", intern =
> > TRUE))) X2 <- replicate(N, as.integer(system("bash -c 'echo $RANDOM'",
> > intern = TRUE))) X <- X1 + 2 ^ 15 * (X2 > 2^14)
> > 
> > and with numbers from random.org
> > 
> > library(random)
> > X <- randomNumbers(N, 0, 2^16-1, col = 1)
> > 
> > So I'm not convinced there is an issue.
> 
> There is an issue, but it is in vseq.c.
> 
> The plot I found striking was this:
> 
> http://people.redhat.com/sgrubb/files/r-random.jpg
> 
> It shows a scatter plot that is bounded to some rectangle where the
> upper right and lower left corner are empty. Roughly speaking, X and Y
> correspond to *consecutive differences* between random draws. It is
> obvious that differences between random draws are bounded by the range
> of the RNG, and that there cannot be two *differences in a row* that are
> close to the maximum (or minimum). Hence the expected shape for such a
> scatter plot is a rectangle with two corners being forbidden.
> 
> Within the allowed region, there should be no structure what so ever
> (given enough draws). And that was striking about the above picture: It
> showed clear vertical bands which should not be there. MT does fail some
> statistical tests, but it cannot be brought down that easily.
> 
> Interestingly, I first used Dirk's C++ function for convenience, and
> that did *not* show these bands. But when I compiled vseq.c I could
> reproduce this. To cut this short: There is an error in vseq.c when the
> numbers are read in:
> 
> tmp = strtoul(buf, NULL, 16);
> 
> The third argument to strtoul is the base in which the numbers should be
> interpreted. However, R has written numbers with base 10. Those can be
> interpreted as base 16, but they will mean something different. Once one
> changes the above line to
> 
> tmp = strtoul(buf, NULL, 10);
> 
> the bands do disappear.

Yes. I just discovered the problem also. I was looking at how my bash script 
worked fine and how the example Luke gave had a problem. I was using the 
print command to keep things in hex. A corrected copy was uploaded so no one 
else runs across this.

Best Regards,
-Steve

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel