[Rd] mcparallel / mccollect

2016-08-30 Thread Michel Lang
Hi there,

I've tried to implement an asynchronous job scheduler using
parallel::mcparallel() and parallel::mccollect(..., wait=FALSE). My
goal was to send processes to the background, leaving the R session
open for interactive use while all jobs store their results on the
file system. To keep track of the running jobs I've stored the process
ids and written a little helper to not spawn new threads before
already started threads have terminated if the maximum number of CPUs
is reached.

Unfortunately, this turned out to be impossible with the current
implementation in parallel for a number of reasons:

1) The returned results are not named by process id or job name if
wait is set to FALSE.

2) The number of returned results depends on the state of computation:
If all or none jobs are finished, just NULL is returned. Otherwise a
list of so far collected results is returned.

3) Combining (1) and (2) renders mapping the results to the stored
process ids impossible. E.g., if you query mccollect for the results
of 4 jobs and set wait=FALSE, you can get an unnamed list with one
result or a list with four results but in a different order.

4) An obvious workaround would wrap the expression to evaluate in a
function which sticks a unique identifier to the return value. This
way, one would not have to rely on process ids or job names. However,
each job has to be collected twice:  the first time you get the result
(which is fine for the workaround), the second time you just get NULL.
And you have to collect them twice to free used resources -- at least
on unix systems.

Here is a small example to illustrate the current behavior:

library(parallel)
f = function(x) { Sys.sleep(x); sprintf("job with x = %i", x) }
jobs = integer()
jobs = c(jobs, mcparallel(f(10), name = "jobname1")$pid)
jobs = c(jobs, mcparallel(f(3), name = "jobname2")$pid)

for (i in 1:13) {
  message("\ni = ", i)
  print(mccollect(jobs, wait = FALSE, timeout = 0))
  Sys.sleep(1)
}

I've created a small patch
()
which applies the same mechanism to name the results for wait=FALSE as
it was already implemented for wait=TRUE. I think the documentation is
already rather describing the behavior after my patch than before my
patch.
A note on the need to collect results twice might prove useful for the
future though.

Thanks,
Michel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] mcparallel / mccollect

2016-08-30 Thread Simon Urbanek
Michel,

thanks, you're right, that the list should have names. Your patch has the 
match() part backwards, but is otherwise the right idea. I have committed a 
variant in R-devel and will back-port later.

Thanks,
Simon



> On Aug 30, 2016, at 8:43 AM, Michel Lang  wrote:
> 
> Hi there,
> 
> I've tried to implement an asynchronous job scheduler using
> parallel::mcparallel() and parallel::mccollect(..., wait=FALSE). My
> goal was to send processes to the background, leaving the R session
> open for interactive use while all jobs store their results on the
> file system. To keep track of the running jobs I've stored the process
> ids and written a little helper to not spawn new threads before
> already started threads have terminated if the maximum number of CPUs
> is reached.
> 
> Unfortunately, this turned out to be impossible with the current
> implementation in parallel for a number of reasons:
> 
> 1) The returned results are not named by process id or job name if
> wait is set to FALSE.
> 
> 2) The number of returned results depends on the state of computation:
> If all or none jobs are finished, just NULL is returned. Otherwise a
> list of so far collected results is returned.
> 
> 3) Combining (1) and (2) renders mapping the results to the stored
> process ids impossible. E.g., if you query mccollect for the results
> of 4 jobs and set wait=FALSE, you can get an unnamed list with one
> result or a list with four results but in a different order.
> 
> 4) An obvious workaround would wrap the expression to evaluate in a
> function which sticks a unique identifier to the return value. This
> way, one would not have to rely on process ids or job names. However,
> each job has to be collected twice:  the first time you get the result
> (which is fine for the workaround), the second time you just get NULL.
> And you have to collect them twice to free used resources -- at least
> on unix systems.
> 
> Here is a small example to illustrate the current behavior:
> 
> library(parallel)
> f = function(x) { Sys.sleep(x); sprintf("job with x = %i", x) }
> jobs = integer()
> jobs = c(jobs, mcparallel(f(10), name = "jobname1")$pid)
> jobs = c(jobs, mcparallel(f(3), name = "jobname2")$pid)
> 
> for (i in 1:13) {
>  message("\ni = ", i)
>  print(mccollect(jobs, wait = FALSE, timeout = 0))
>  Sys.sleep(1)
> }
> 
> I've created a small patch
> ()
> which applies the same mechanism to name the results for wait=FALSE as
> it was already implemented for wait=TRUE. I think the documentation is
> already rather describing the behavior after my patch than before my
> patch.
> A note on the need to collect results twice might prove useful for the
> future though.
> 
> Thanks,
> Michel
> 
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> 

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] A bug in the R Mersenne Twister (RNG) code?

2016-08-30 Thread Mark Roberts
Whomever,

I recently sent the "bug report" below tor-c...@r-project.org and have 
just been asked to instead submit it to you.

Although I am basically not an R user, I have installed version 3.3.1 
and am also the author of a statistics program written in Visual Basic 
that contains a component which correctly implements the Mersenne 
Twister (MT) algorithm.  I believe that it is not possible to generate 
the correct stream of pseudorandom numbers using the MT default random 
number generator in R, and am not the first person to notice this.  Here 
is a posted 2013 entry 
(www.r-bloggers.com/reproducibility-and-randomness/) on an R website 
that asserts that the SAS computer program implementation of the MT 
algorithm produces different numbers than R does when using the same 
starting seed number.  The author of this post didn’t get anyone to 
respond to his query about the reason for this SAS vs. R discrepancy.

There are two ways of initializing the original MT computer program 
(written in C) so that an identical stream of numbers can be repeatedly 
generated:  1) with a particular integer seed number, and 2) with a 
particular array of integers.   In the 'compilation and usage' section 
of this webpage (https://github.com/cslarsen/mersenne-twister) there is 
a listing of the first 200 random numbers the MT algorithm should 
produce for seed number = 1.  The inventors of the Mersenne Twister 
random number generator provided two different sets of the first 1000 
numbers produced by a correctly coded 32-bit implementation of the MT 
algorithm when initializing it with a particular array of integers at: 
www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/MT2002/CODES/mt19937ar.out. 
[There is a link to this output at: 
www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/MT2002/emt19937ar.html.]

My statistics program obtains exactly those 200 numbers from the first 
site mentioned in the previous paragraph and also obtains those same 
numbers from the second website (though I didn't check all 2000 values). 
   Assuming that the MT code within R uses the 32-bit MT algorithm, I 
suspect that the current version of R can't do that.  If you (i.e., 
anyone who might knowledgeably respond to this report) is able to 
duplicate those reference test-values, then please send me the R code to 
initialize the MT code within R to successfully do that, and I apologize 
for having wasted your time. If you (collectively) can't do that, then R 
is very likely using incorrectly implemented MT code.  And if this 
latter possibility is true, it seems to me that this is something that 
should be fixed.

Mark Roberts, Ph.D.

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] A bug in the R Mersenne Twister (RNG) code?

2016-08-30 Thread William Dunlap via R-devel
Try comparing the streams for when the 625-integer versions of the seeds
are identical.  (R's seed is 626 integers: omit the first value, which
indicates which random number generator the seed is for.).  I find the the
MKL Mersenne Twister results match R's (with occassional differences in the
last bit) when the 625-integer seeds the same.

I believe R fiddles with the single-integer seed to spread it out a bit.
S's seed was taken modulo 1024 so old users tended not use use single-seeds
bigger than 1023.


Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Tue, Aug 30, 2016 at 2:45 PM, Mark Roberts  wrote:

> Whomever,
>
> I recently sent the "bug report" below tor-c...@r-project.org and have
> just been asked to instead submit it to you.
>
> Although I am basically not an R user, I have installed version 3.3.1
> and am also the author of a statistics program written in Visual Basic
> that contains a component which correctly implements the Mersenne
> Twister (MT) algorithm.  I believe that it is not possible to generate
> the correct stream of pseudorandom numbers using the MT default random
> number generator in R, and am not the first person to notice this.  Here
> is a posted 2013 entry
> (www.r-bloggers.com/reproducibility-and-randomness/) on an R website
> that asserts that the SAS computer program implementation of the MT
> algorithm produces different numbers than R does when using the same
> starting seed number.  The author of this post didn’t get anyone to
> respond to his query about the reason for this SAS vs. R discrepancy.
>
> There are two ways of initializing the original MT computer program
> (written in C) so that an identical stream of numbers can be repeatedly
> generated:  1) with a particular integer seed number, and 2) with a
> particular array of integers.   In the 'compilation and usage' section
> of this webpage (https://github.com/cslarsen/mersenne-twister) there is
> a listing of the first 200 random numbers the MT algorithm should
> produce for seed number = 1.  The inventors of the Mersenne Twister
> random number generator provided two different sets of the first 1000
> numbers produced by a correctly coded 32-bit implementation of the MT
> algorithm when initializing it with a particular array of integers at:
> www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/MT2002/CODES/mt19937ar.out.
> [There is a link to this output at:
> www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/MT2002/emt19937ar.html.]
>
> My statistics program obtains exactly those 200 numbers from the first
> site mentioned in the previous paragraph and also obtains those same
> numbers from the second website (though I didn't check all 2000 values).
>Assuming that the MT code within R uses the 32-bit MT algorithm, I
> suspect that the current version of R can't do that.  If you (i.e.,
> anyone who might knowledgeably respond to this report) is able to
> duplicate those reference test-values, then please send me the R code to
> initialize the MT code within R to successfully do that, and I apologize
> for having wasted your time. If you (collectively) can't do that, then R
> is very likely using incorrectly implemented MT code.  And if this
> latter possibility is true, it seems to me that this is something that
> should be fixed.
>
> Mark Roberts, Ph.D.
>
> [[alternative HTML version deleted]]
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] A bug in the R Mersenne Twister (RNG) code?

2016-08-30 Thread Duncan Murdoch
I don't see evidence of a bug.  There have been several versions of the 
MT; we may be using a different version than you are.  Ours is the 
1999/10/28 version; the web page you cite uses one from 2002.


Perhaps the newer version fixes some problems, and then it would be 
worth considering a change.  But changing the default RNG definitely 
introduces problems in reproducibility, so it's not obvious that we 
would do it.


Duncan Murdoch


On 30/08/2016 5:45 PM, Mark Roberts wrote:

Whomever,

I recently sent the "bug report" below tor-c...@r-project.org and have
just been asked to instead submit it to you.

Although I am basically not an R user, I have installed version 3.3.1
and am also the author of a statistics program written in Visual Basic
that contains a component which correctly implements the Mersenne
Twister (MT) algorithm.  I believe that it is not possible to generate
the correct stream of pseudorandom numbers using the MT default random
number generator in R, and am not the first person to notice this.  Here
is a posted 2013 entry
(www.r-bloggers.com/reproducibility-and-randomness/) on an R website
that asserts that the SAS computer program implementation of the MT
algorithm produces different numbers than R does when using the same
starting seed number.  The author of this post didn’t get anyone to
respond to his query about the reason for this SAS vs. R discrepancy.

There are two ways of initializing the original MT computer program
(written in C) so that an identical stream of numbers can be repeatedly
generated:  1) with a particular integer seed number, and 2) with a
particular array of integers.   In the 'compilation and usage' section
of this webpage (https://github.com/cslarsen/mersenne-twister) there is
a listing of the first 200 random numbers the MT algorithm should
produce for seed number = 1.  The inventors of the Mersenne Twister
random number generator provided two different sets of the first 1000
numbers produced by a correctly coded 32-bit implementation of the MT
algorithm when initializing it with a particular array of integers at:
www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/MT2002/CODES/mt19937ar.out.
[There is a link to this output at:
www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/MT2002/emt19937ar.html.]

My statistics program obtains exactly those 200 numbers from the first
site mentioned in the previous paragraph and also obtains those same
numbers from the second website (though I didn't check all 2000 values).
   Assuming that the MT code within R uses the 32-bit MT algorithm, I
suspect that the current version of R can't do that.  If you (i.e.,
anyone who might knowledgeably respond to this report) is able to
duplicate those reference test-values, then please send me the R code to
initialize the MT code within R to successfully do that, and I apologize
for having wasted your time. If you (collectively) can't do that, then R
is very likely using incorrectly implemented MT code.  And if this
latter possibility is true, it seems to me that this is something that
should be fixed.

Mark Roberts, Ph.D.

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel



__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel