Re: [Rd] Parallel R CMD check?

2012-02-17 Thread Prof Brian Ripley

On 17/02/2012 06:50, Martin Morgan wrote:

Running R CMD check on a package can take quite a lot of time. Checks
seem like they could be run in parallel (separate processes for, e.g.,
codoc, examples, tests, ...). Is there a way to do this? My current
usage is typically

R CMD build 
R CMD check pkg_x.y.z.tar.gz

Thanks for any hints,


Not at present.  It would need a lot of re-organization of the check.R 
code to collect output and present it in a reasonable order.  I rather 
doubt is worth the effort: most of us with machines with large numbers 
of cores are not just checking one package at a time, and for many 
packages with long check times it is one aspect of the check which takes 
most of the time.


We have considered running separate tests and vignette R code in 
parallel: it was one of the motivations of having package 'parallel'. 
But even then people complained when we batched up all the test output 
and only reported it when all the tests had been run.




Martin



--
Brian D. Ripley,  rip...@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] building r-devel

2012-02-17 Thread Ben Bolker

  I'm sure I'm being an absolute bonehead (again), but can someone
suggest what I might be doing wrong?

 Trying to build latest r-devel on Ubuntu 10.04; as recommended in
various places, trying to build it in a separate build directory (since
starting to compose this e-mail I've found that it works if I build in
the original source directory [as disrecommended], so this has become a
less urgent question, but I'm still curious what I'm doing wrong).  I
configure, make (see details below), eventually get to
"/usr/bin/install: cannot create regular file
`../../include/Rinternals.h': No such file or directory
make[2]: *** [Rinternals.ts] Error 1".


  Any thoughts about what's going on here or what I can do to further
diagnose the problem?

  thanks
Ben Bolker

==

bolker@ubuntu-10:~/R/r-devel$ svn update
At revision 58379.

cd ../r-build
make distclean
../r-devel/configure


R is now configured for i686-pc-linux-gnu

  Source directory:  ../r-devel
  Installation directory:/usr/local

  C compiler:gcc -std=gnu99  -g -O2
  Fortran 77 compiler:   gfortran  -g -O2

  C++ compiler:  g++  -g -O2
  Fortran 90/95 compiler:gfortran -g -O2
  Obj-C compiler:   

  Interfaces supported:  X11, tcltk
  External libraries:readline
  Additional capabilities:   PNG, JPEG, NLS, cairo
  Options enabled:   shared BLAS, R profiling, Java

  Recommended packages:  yes

make

bolker@ubuntu-10:~/R/r-build$ make
make[1]: Entering directory `/mnt/hgfs/bolker/Documents/R/r-build/m4'
make[1]: Nothing to be done for `R'.
make[1]: Leaving directory `/mnt/hgfs/bolker/Documents/R/r-build/m4'
make[1]: Entering directory `/mnt/hgfs/bolker/Documents/R/r-build/tools'
make[1]: Nothing to be done for `R'.
make[1]: Leaving directory `/mnt/hgfs/bolker/Documents/R/r-build/tools'
make[1]: Entering directory `/mnt/hgfs/bolker/Documents/R/r-build/doc'
make[2]: Entering directory `/mnt/hgfs/bolker/Documents/R/r-build/doc/html'
make[3]: Entering directory `/mnt/hgfs/bolker/Documents/R/r-build/doc/html'
make[3]: Leaving directory `/mnt/hgfs/bolker/Documents/R/r-build/doc/html'
make[2]: Leaving directory `/mnt/hgfs/bolker/Documents/R/r-build/doc/html'
make[2]: Entering directory
`/mnt/hgfs/bolker/Documents/R/r-build/doc/manual'
make[2]: Nothing to be done for `R'.
make[2]: Leaving directory `/mnt/hgfs/bolker/Documents/R/r-build/doc/manual'
make[2]: Entering directory `/mnt/hgfs/bolker/Documents/R/r-build/doc'
make[2]: Leaving directory `/mnt/hgfs/bolker/Documents/R/r-build/doc'
make[1]: Leaving directory `/mnt/hgfs/bolker/Documents/R/r-build/doc'
make[1]: Entering directory `/mnt/hgfs/bolker/Documents/R/r-build/etc'
make[1]: Leaving directory `/mnt/hgfs/bolker/Documents/R/r-build/etc'
make[1]: Entering directory `/mnt/hgfs/bolker/Documents/R/r-build/share'
make[2]: Entering directory `/mnt/hgfs/bolker/Documents/R/r-build/share'
mkdir -p -- ../share/R
mkdir -p -- ../share/encodings
mkdir -p -- ../share/java
mkdir -p -- ../share/licenses
mkdir -p -- ../share/make
mkdir -p -- ../share/sh
mkdir -p -- ../share/texmf
mkdir -p -- ../share/texmf/bibtex/bib
mkdir -p -- ../share/texmf/bibtex/bst
mkdir -p -- ../share/texmf/tex/latex
make[2]: Leaving directory `/mnt/hgfs/bolker/Documents/R/r-build/share'
make[1]: Leaving directory `/mnt/hgfs/bolker/Documents/R/r-build/share'
make[1]: Entering directory `/mnt/hgfs/bolker/Documents/R/r-build/src'
make[2]: Entering directory
`/mnt/hgfs/bolker/Documents/R/r-build/src/scripts'
creating src/scripts/R.fe
make[3]: Entering directory
`/mnt/hgfs/bolker/Documents/R/r-build/src/scripts'
mkdir -p -- ../../bin
make[3]: Leaving directory
`/mnt/hgfs/bolker/Documents/R/r-build/src/scripts'
make[2]: Leaving directory
`/mnt/hgfs/bolker/Documents/R/r-build/src/scripts'
make[2]: Entering directory
`/mnt/hgfs/bolker/Documents/R/r-build/src/include'
/usr/bin/install: cannot create regular file
`../../include/Rinternals.h': No such file or directory
make[2]: *** [Rinternals.ts] Error 1
make[2]: Leaving directory
`/mnt/hgfs/bolker/Documents/R/r-build/src/include'
make[1]: *** [R] Error 1
make[1]: Leaving directory `/mnt/hgfs/bolker/Documents/R/r-build/src'
make: *** [R] Error 1

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Parallel R CMD check?

2012-02-17 Thread Martin Morgan

On 02/17/2012 01:42 AM, Prof Brian Ripley wrote:

On 17/02/2012 06:50, Martin Morgan wrote:

Running R CMD check on a package can take quite a lot of time. Checks
seem like they could be run in parallel (separate processes for, e.g.,
codoc, examples, tests, ...). Is there a way to do this? My current
usage is typically

R CMD build 
R CMD check pkg_x.y.z.tar.gz

Thanks for any hints,


Not at present. It would need a lot of re-organization of the check.R
code to collect output and present it in a reasonable order. I rather
doubt is worth the effort: most of us with machines with large numbers
of cores are not just checking one package at a time, and for many
packages with long check times it is one aspect of the check which takes
most of the time.


OK thank you. My own 'issue' is that the package takes ~ 20s to load 
(because of dependencies). The check process seems to load the package 
at least 10 times, so 200s in package loading. I'm thinking of this from 
a package developer perspective, rather than checking many packages. 
Obviously from my end I can work to reduce the dependencies (and their 
loading times).


Martin




We have considered running separate tests and vignette R code in
parallel: it was one of the motivations of having package 'parallel'.
But even then people complained when we batched up all the test output
and only reported it when all the tests had been run.



Martin






--
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Parallel R CMD check?

2012-02-17 Thread Paul Gilbert



On 12-02-17 01:19 PM, Martin Morgan wrote:

On 02/17/2012 01:42 AM, Prof Brian Ripley wrote:

On 17/02/2012 06:50, Martin Morgan wrote:

Running R CMD check on a package can take quite a lot of time. Checks
seem like they could be run in parallel (separate processes for, e.g.,
codoc, examples, tests, ...). Is there a way to do this? My current
usage is typically

R CMD build 
R CMD check pkg_x.y.z.tar.gz

Thanks for any hints,


Not at present. It would need a lot of re-organization of the check.R
code to collect output and present it in a reasonable order. I rather
doubt is worth the effort: most of us with machines with large numbers
of cores are not just checking one package at a time, and for many
packages with long check times it is one aspect of the check which takes
most of the time.


OK thank you. My own 'issue' is that the package takes ~ 20s to load
(because of dependencies). The check process seems to load the package
at least 10 times, so 200s in package loading. I'm thinking of this from
a package developer perspective, rather than checking many packages.
Obviously from my end I can work to reduce the dependencies (and their
loading times).


For my own package development testing I use a make target that does 
something like:

R CMD BATCH  --vanilla tests/$(notdir $@)  $@.tmp
and then make -j  does all the tests in parallel. Of course, when the 
package is actually working this means that I run the tests twice, once 
individually with this target, and another time when I do R CMD check 
(then again on R-forge, and then on CRAN Kurt catches something I 
missed.)  So, when you are really in development mode the above target 
is useful, because you get to the failure point quickly, but if 
everything is working then it is slower.


Happy to share more details off-line if you want.

Paul


Martin




We have considered running separate tests and vignette R code in
parallel: it was one of the motivations of having package 'parallel'.
But even then people complained when we batched up all the test output
and only reported it when all the tests had been run.



Martin








__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] building r-devel

2012-02-17 Thread Dirk Eddelbuettel

Ben,

On 17 February 2012 at 11:31, Ben Bolker wrote:
| 
|   I'm sure I'm being an absolute bonehead (again), but can someone
| suggest what I might be doing wrong?
| 
|  Trying to build latest r-devel on Ubuntu 10.04; as recommended in

Happy to take this off-list --- I do build every now and then off R-devel's
SVN because R CMD check really wants to be done again R-release and R-devel
as that is what Vienna does...

I just invoke one simple shell script (which regroups the various configure
options) and have --prefix set to /usr/local/lib/R-devel, and all this works
pretty reliably.  Like you, I do build from the SVN dir. No issues AFAICT.

Dirk

-- 
"Outside of a dog, a book is a man's best friend. Inside of a dog, it is too
dark to read." -- Groucho Marx

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] portable parallel seeds project: request for critiques

2012-02-17 Thread Paul Johnson
I've got another edition of my simulation replication framework.  I'm
attaching 2 R files and pasting in the readme.

I would especially like to know if I'm doing anything that breaks
.Random.seed or other things that R's parallel uses in the
environment.

In case you don't want to wrestle with attachments, the same files are
online in our SVN

http://winstat.quant.ku.edu/svn/hpcexample/trunk/Ex66-ParallelSeedPrototype/


## Paul E. Johnson CRMDA 
## Portable Parallel Seeds Project.
## 2012-02-18

Portable Parallel Seeds Project

This is how I'm going to recommend we work with random number seeds in
simulations. It enhances work that requires runs with random numbers,
whether runs are in a cluster computing environment or in a single
workstation.

It is a solution for two separate problems.

Problem 1. I scripted up 1000 R runs and need high quality,
unique, replicable random streams for each one. Each simulation
runs separately, but I need to be confident their streams are
not correlated or overlapping. For replication, I need to be able to
select any run, say 667, and restart it exactly as it was.

Problem 2. I've written a Parallel MPI (Message Passing Interface)
routine that launches 1000 runs and I need to assure each has
a unique, replicatable, random stream. I need to be able to
select any run, say 667, and restart it exactly as it was.

This project develops one approach to create replicable simulations.
It blends ideas about seed management from John M. Chambers
Software for Data Analysis (2008) with ideas from the snowFT
package by Hana Sevcikova and Tony R. Rossini.


Here's my proposal.

1. Run a preliminary program to generate an array of seeds

run1:   seed1.1   seed1.2   seed1.3
run2:   seed2.1   seed2.2   seed2.3
run3:   seed3.1   seed3.2   seed3.3
...  ...   ...
run1000   seed1000.1  seed1000.2   seed1000.3

This example provides 3 separate streams of random numbers within each
run. Because we will use the L'Ecuyer "many separate streams"
approach, we are confident that there is no correlation or overlap
between any of the runs.

The projSeeds has to have one row per project, but it is not a huge
file. I created seeds for 2000 runs of a project that requires 2 seeds
per run.  The saved size of the file 104443kb, which is very small. By
comparison, a 1400x1050 jpg image would usually be twice that size.
If you save 10,000 runs-worth of seeds, the size rises to 521,993kb,
still pretty small.

Because the seeds are saved in a file, we are sure each
run can be replicated. We just have to teach each program
how to use the seeds. That is step two.


2. Inside each run, an initialization function runs that loads the
seeds file and takes the row of seeds that it needs.  As the
simulation progresses, the user can ask for random numbers from the
separate streams. When we need random draws from a particular stream,
we set the variable "currentStream" with the function useStream().

The function initSeedStreams creates several objects in
the global environment. It sets the integer currentStream,
as well as two list objects, startSeeds and currentSeeds.
At the outset of the run, startSeeds and currentSeeds
are the same thing. When we change the currentStream
to a different stream, the currentSeeds vector is
updated to remember where that stream was when we stopped
drawing numbers from it.


Now, for the proof of concept. A working example.

Step 1. Create the Seeds. Review the R program

seedCreator.R

That creates the file "projSeeds.rda".


Step 2. Use one row of seeds per run.

Please review "controlledSeeds.R" to see an example usage
that I've tested on a cluster.

"controlledSeeds.R" can also be run on a single workstation for
testing purposes.  There is a variable "runningInMPI" which determines
whether the code is supposed to run on the RMPI cluster or just in a
single workstation.


The code for each run of the model begins by loading the
required libraries and loading the seed file, if it exists, or
generating a new "projSeed" object if it is not found.

library(parallel)
RNGkind("L'Ecuyer-CMRG")
set.seed(234234)
if (file.exists("projSeeds.rda")) {
  load("projSeeds.rda")
} else {
  source("seedCreator.R")
}

## Suppose the "run" number is:
run <- 232
initSeedStreams(run)

After that, R's random generator functions will draw values
from the first random random stream that was initialized
in projSeeds. When each repetition (run) occurs,
R looks up the right seed for that run, and uses it.

If the user wants to begin drawing observations from the
second random stream, this command is used:

useStream(2)

If the user has drawn values from stream 1 already, but
wishes to begin again at the initial point in that stream,
use this command

useStream(1, origin = TRUE)


Question: Why is this approach better for parallel runs?

Answer: After a batch of simulations, we can re-start any
one of them and repeat it exactly. This builds on the idea
of the snowFT package, by Hana Sevcikova and A.J

Re: [Rd] executable files R package

2012-02-17 Thread sahir bhatnagar
thanks,
I will not submit to CRAN.

I am having trouble going about including the .exe files in my package.
>From the readings I see that the .exe files must be placed in a 'src'
folder. But I don't see how I can access those files in R, without having
to specify its path in the R command 'system'. I would like for the user to
only have to input a data file, which is then used in the .exe file.

My problem is the following:
Create a function which has two user inputs i.e. datasets: D1.txt and D2.txt
I have two '.exe' files i.e. E1.exe and E2.exe

E1 takes in D1 and then outputs a text file say "text"

then E2 takes in D2 and the "text", which outputs the result.

Can this be done (even if it means having two functions) and then assembled
in a package, without the user having to specify the path of the .exe
files, as well as the "text" that is outputted from running E1?

Any direction as to how I can go about creating a package that would
include these '.exe. files? I have only found documentation on calling .C
code in R.

Any help is much appreciated

On Wed, Feb 15, 2012 at 10:14 AM, Duncan Murdoch
wrote:

> On 13/02/2012 2:36 PM, sahir bhatnagar wrote:
>
>> I am in the process of creating a package in R which calls
>> pre-compiled C code i.e. '.exe' files in Windows.
>>
>> Since CRAN will not accept packages with binary code files, what are
>> my options to meet the requirements while still including the
>> executable file?
>>
>
> I think you should ask the CRAN administrators that, but my understanding
> is that they are unlikely to accept your package as described.  CRAN is
> interested in platform-neutral packages, and if you have an .exe, you're
> going to be Windows-only.
>
> If you include the source code for that .exe and put together the Makefile
> to compile it, then they'd be more receptive, and someone might offer help
> to get it to run on other platforms if it doesn't on your first attempt.
>
> If you don't want to include the .exe source (or can't), I think you
> should just publish it on your own web page.
>
> Duncan Murdoch
>
>  I read section 1.5.2 of the manual which mentions three options two of
>> which involve negotiating with CRAN administrators. The third
>> references the package Cairo which arranges to download additional
>> software, but I don't see how this will allow my package to get
>> accepted.
>>
>> It would seem that I need to ensure that my package works under both
>> architectures (32 and 64 bit).
>>
>> 1) Would this be sufficient to get it accepted?
>> 2) If so, does anyone have any documentation in performing this task,
>> or can someone point me in the right direction?
>>
>> I was told that 'arulesSequences' is an example of a CRAN package
>> while compiles executables. Was this package accepted because it
>> worked under both architectures? or are there other reasons.
>>
>> thanks
>>
>> __**
>> R-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/**listinfo/r-devel
>>
>
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] executable files R package

2012-02-17 Thread William Dunlap
If you put your prebuilt.exe into a directory under the
'source' package's inst directory, say yourPkg/inst/executables/win32,
then the installed package will have them in in
yourPkg/executables/win32 and the user (via code you
write, presumably) can get the full path to the executable
in the installed package with
  system.file(package="yourPkg", "executables", "win32", "prebuilt.exe").
Paste the output of that into the command given to system().

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com 

> -Original Message-
> From: r-devel-boun...@r-project.org [mailto:r-devel-boun...@r-project.org] On 
> Behalf Of sahir
> bhatnagar
> Sent: Friday, February 17, 2012 12:59 PM
> To: Duncan Murdoch
> Cc: r-devel@r-project.org
> Subject: Re: [Rd] executable files R package
> 
> thanks,
> I will not submit to CRAN.
> 
> I am having trouble going about including the .exe files in my package.
> >From the readings I see that the .exe files must be placed in a 'src'
> folder. But I don't see how I can access those files in R, without having
> to specify its path in the R command 'system'. I would like for the user to
> only have to input a data file, which is then used in the .exe file.
> 
> My problem is the following:
> Create a function which has two user inputs i.e. datasets: D1.txt and D2.txt
> I have two '.exe' files i.e. E1.exe and E2.exe
> 
> E1 takes in D1 and then outputs a text file say "text"
> 
> then E2 takes in D2 and the "text", which outputs the result.
> 
> Can this be done (even if it means having two functions) and then assembled
> in a package, without the user having to specify the path of the .exe
> files, as well as the "text" that is outputted from running E1?
> 
> Any direction as to how I can go about creating a package that would
> include these '.exe. files? I have only found documentation on calling .C
> code in R.
> 
> Any help is much appreciated
> 
> On Wed, Feb 15, 2012 at 10:14 AM, Duncan Murdoch
> wrote:
> 
> > On 13/02/2012 2:36 PM, sahir bhatnagar wrote:
> >
> >> I am in the process of creating a package in R which calls
> >> pre-compiled C code i.e. '.exe' files in Windows.
> >>
> >> Since CRAN will not accept packages with binary code files, what are
> >> my options to meet the requirements while still including the
> >> executable file?
> >>
> >
> > I think you should ask the CRAN administrators that, but my understanding
> > is that they are unlikely to accept your package as described.  CRAN is
> > interested in platform-neutral packages, and if you have an .exe, you're
> > going to be Windows-only.
> >
> > If you include the source code for that .exe and put together the Makefile
> > to compile it, then they'd be more receptive, and someone might offer help
> > to get it to run on other platforms if it doesn't on your first attempt.
> >
> > If you don't want to include the .exe source (or can't), I think you
> > should just publish it on your own web page.
> >
> > Duncan Murdoch
> >
> >  I read section 1.5.2 of the manual which mentions three options two of
> >> which involve negotiating with CRAN administrators. The third
> >> references the package Cairo which arranges to download additional
> >> software, but I don't see how this will allow my package to get
> >> accepted.
> >>
> >> It would seem that I need to ensure that my package works under both
> >> architectures (32 and 64 bit).
> >>
> >> 1) Would this be sufficient to get it accepted?
> >> 2) If so, does anyone have any documentation in performing this task,
> >> or can someone point me in the right direction?
> >>
> >> I was told that 'arulesSequences' is an example of a CRAN package
> >> while compiles executables. Was this package accepted because it
> >> worked under both architectures? or are there other reasons.
> >>
> >> thanks
> >>
> >> __**
> >> R-devel@r-project.org mailing list
> >> https://stat.ethz.ch/mailman/**listinfo/r-devel
> >>
> >
> >
> 
>   [[alternative HTML version deleted]]
> 
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] portable parallel seeds project: request for critiques

2012-02-17 Thread Paul Gilbert

Paul

I think (perhaps incorrectly) of the general problem being that one 
wants to run a random experiment, on a single node, or two nodes, or ten 
nodes, or any number of nodes, and reliably be able to reproduce the 
experiment without concern about how many nodes it runs on when you 
re-run it.


From your description I don't have the impression your solution would 
do that. Am I misunderstanding?


A second problem is that you want to use a proven algorithm for 
generating the numbers. This is implicitly solved by the above, because 
you always get the same result as you do on one node with a well proven 
RNG. If you generate a string of seed and then numbers from those, do 
you have a proven RNG?


Paul

On 12-02-17 03:57 PM, Paul Johnson wrote:

I've got another edition of my simulation replication framework.  I'm
attaching 2 R files and pasting in the readme.

I would especially like to know if I'm doing anything that breaks
.Random.seed or other things that R's parallel uses in the
environment.

In case you don't want to wrestle with attachments, the same files are
online in our SVN

http://winstat.quant.ku.edu/svn/hpcexample/trunk/Ex66-ParallelSeedPrototype/


## Paul E. Johnson CRMDA
## Portable Parallel Seeds Project.
## 2012-02-18

Portable Parallel Seeds Project

This is how I'm going to recommend we work with random number seeds in
simulations. It enhances work that requires runs with random numbers,
whether runs are in a cluster computing environment or in a single
workstation.

It is a solution for two separate problems.

Problem 1. I scripted up 1000 R runs and need high quality,
unique, replicable random streams for each one. Each simulation
runs separately, but I need to be confident their streams are
not correlated or overlapping. For replication, I need to be able to
select any run, say 667, and restart it exactly as it was.

Problem 2. I've written a Parallel MPI (Message Passing Interface)
routine that launches 1000 runs and I need to assure each has
a unique, replicatable, random stream. I need to be able to
select any run, say 667, and restart it exactly as it was.

This project develops one approach to create replicable simulations.
It blends ideas about seed management from John M. Chambers
Software for Data Analysis (2008) with ideas from the snowFT
package by Hana Sevcikova and Tony R. Rossini.


Here's my proposal.

1. Run a preliminary program to generate an array of seeds

run1:   seed1.1   seed1.2   seed1.3
run2:   seed2.1   seed2.2   seed2.3
run3:   seed3.1   seed3.2   seed3.3
...  ...   ...
run1000   seed1000.1  seed1000.2   seed1000.3

This example provides 3 separate streams of random numbers within each
run. Because we will use the L'Ecuyer "many separate streams"
approach, we are confident that there is no correlation or overlap
between any of the runs.

The projSeeds has to have one row per project, but it is not a huge
file. I created seeds for 2000 runs of a project that requires 2 seeds
per run.  The saved size of the file 104443kb, which is very small. By
comparison, a 1400x1050 jpg image would usually be twice that size.
If you save 10,000 runs-worth of seeds, the size rises to 521,993kb,
still pretty small.

Because the seeds are saved in a file, we are sure each
run can be replicated. We just have to teach each program
how to use the seeds. That is step two.


2. Inside each run, an initialization function runs that loads the
seeds file and takes the row of seeds that it needs.  As the
simulation progresses, the user can ask for random numbers from the
separate streams. When we need random draws from a particular stream,
we set the variable "currentStream" with the function useStream().

The function initSeedStreams creates several objects in
the global environment. It sets the integer currentStream,
as well as two list objects, startSeeds and currentSeeds.
At the outset of the run, startSeeds and currentSeeds
are the same thing. When we change the currentStream
to a different stream, the currentSeeds vector is
updated to remember where that stream was when we stopped
drawing numbers from it.


Now, for the proof of concept. A working example.

Step 1. Create the Seeds. Review the R program

seedCreator.R

That creates the file "projSeeds.rda".


Step 2. Use one row of seeds per run.

Please review "controlledSeeds.R" to see an example usage
that I've tested on a cluster.

"controlledSeeds.R" can also be run on a single workstation for
testing purposes.  There is a variable "runningInMPI" which determines
whether the code is supposed to run on the RMPI cluster or just in a
single workstation.


The code for each run of the model begins by loading the
required libraries and loading the seed file, if it exists, or
generating a new "projSeed" object if it is not found.

library(parallel)
RNGkind("L'Ecuyer-CMRG")
set.seed(234234)
if (file.exists("projSeeds.rda")) {
   load("projSeeds.rda")
} else {
   source("seedCreator.R")
}

## Suppose the "run" number

Re: [Rd] portable parallel seeds project: request for critiques

2012-02-17 Thread Paul Johnson
On Fri, Feb 17, 2012 at 3:23 PM, Paul Gilbert  wrote:
> Paul
>
> I think (perhaps incorrectly) of the general problem being that one wants to
> run a random experiment, on a single node, or two nodes, or ten nodes, or
> any number of nodes, and reliably be able to reproduce the experiment
> without concern about how many nodes it runs on when you re-run it.
>
> From your description I don't have the impression your solution would do
> that. Am I misunderstanding?
>

Well, I think my approach does that!  Each time a function runs, it grabs
a pre-specified set of seed values and initializes the R .Random.seed
appropriately.

Since I take the pre-specified seeds from the L'Ecuyer et al approach
(cite below), I believe that
means each separate stream is dependably uncorrelated and non-overlapping, both
within a particular run and across runs.

> A second problem is that you want to use a proven algorithm for generating
> the numbers. This is implicitly solved by the above, because you always get
> the same result as you do on one node with a well proven RNG. If you
> generate a string of seed and then numbers from those, do you have a proven
> RNG?
>
Luckily, I think that part was solved by people other than me:

L'Ecuyer, P., Simard, R., Chen, E. J. and Kelton, W. D. (2002) An
object-oriented random-number package with many long streams and
substreams. Operations Research 50 1073–5.
http://www.iro.umontreal.ca/~lecuyer/myftp/papers/streams00.pdf



> Paul
>
>
> On 12-02-17 03:57 PM, Paul Johnson wrote:
>>
>> I've got another edition of my simulation replication framework.  I'm
>> attaching 2 R files and pasting in the readme.
>>
>> I would especially like to know if I'm doing anything that breaks
>> .Random.seed or other things that R's parallel uses in the
>> environment.
>>
>> In case you don't want to wrestle with attachments, the same files are
>> online in our SVN
>>
>>
>> http://winstat.quant.ku.edu/svn/hpcexample/trunk/Ex66-ParallelSeedPrototype/
>>
>>
>> ## Paul E. Johnson CRMDA
>> ## Portable Parallel Seeds Project.
>> ## 2012-02-18
>>
>> Portable Parallel Seeds Project
>>
>> This is how I'm going to recommend we work with random number seeds in
>> simulations. It enhances work that requires runs with random numbers,
>> whether runs are in a cluster computing environment or in a single
>> workstation.
>>
>> It is a solution for two separate problems.
>>
>> Problem 1. I scripted up 1000 R runs and need high quality,
>> unique, replicable random streams for each one. Each simulation
>> runs separately, but I need to be confident their streams are
>> not correlated or overlapping. For replication, I need to be able to
>> select any run, say 667, and restart it exactly as it was.
>>
>> Problem 2. I've written a Parallel MPI (Message Passing Interface)
>> routine that launches 1000 runs and I need to assure each has
>> a unique, replicatable, random stream. I need to be able to
>> select any run, say 667, and restart it exactly as it was.
>>
>> This project develops one approach to create replicable simulations.
>> It blends ideas about seed management from John M. Chambers
>> Software for Data Analysis (2008) with ideas from the snowFT
>> package by Hana Sevcikova and Tony R. Rossini.
>>
>>
>> Here's my proposal.
>>
>> 1. Run a preliminary program to generate an array of seeds
>>
>> run1:   seed1.1   seed1.2   seed1.3
>> run2:   seed2.1   seed2.2   seed2.3
>> run3:   seed3.1   seed3.2   seed3.3
>> ...      ...       ...
>> run1000   seed1000.1  seed1000.2   seed1000.3
>>
>> This example provides 3 separate streams of random numbers within each
>> run. Because we will use the L'Ecuyer "many separate streams"
>> approach, we are confident that there is no correlation or overlap
>> between any of the runs.
>>
>> The projSeeds has to have one row per project, but it is not a huge
>> file. I created seeds for 2000 runs of a project that requires 2 seeds
>> per run.  The saved size of the file 104443kb, which is very small. By
>> comparison, a 1400x1050 jpg image would usually be twice that size.
>> If you save 10,000 runs-worth of seeds, the size rises to 521,993kb,
>> still pretty small.
>>
>> Because the seeds are saved in a file, we are sure each
>> run can be replicated. We just have to teach each program
>> how to use the seeds. That is step two.
>>
>>
>> 2. Inside each run, an initialization function runs that loads the
>> seeds file and takes the row of seeds that it needs.  As the
>> simulation progresses, the user can ask for random numbers from the
>> separate streams. When we need random draws from a particular stream,
>> we set the variable "currentStream" with the function useStream().
>>
>> The function initSeedStreams creates several objects in
>> the global environment. It sets the integer currentStream,
>> as well as two list objects, startSeeds and currentSeeds.
>> At the outset of the run, startSeeds and currentSeeds
>> are the same thing. When we change the currentStream
>> to a different 

Re: [Rd] portable parallel seeds project: request for critiques

2012-02-17 Thread Paul Gilbert

Ok, I guess I need to look more carefully.

Thanks,
Paul

On 12-02-17 04:44 PM, Paul Johnson wrote:

On Fri, Feb 17, 2012 at 3:23 PM, Paul Gilbert  wrote:

Paul

I think (perhaps incorrectly) of the general problem being that one wants to
run a random experiment, on a single node, or two nodes, or ten nodes, or
any number of nodes, and reliably be able to reproduce the experiment
without concern about how many nodes it runs on when you re-run it.

 From your description I don't have the impression your solution would do
that. Am I misunderstanding?



Well, I think my approach does that!  Each time a function runs, it grabs
a pre-specified set of seed values and initializes the R .Random.seed
appropriately.

Since I take the pre-specified seeds from the L'Ecuyer et al approach
(cite below), I believe that
means each separate stream is dependably uncorrelated and non-overlapping, both
within a particular run and across runs.


A second problem is that you want to use a proven algorithm for generating
the numbers. This is implicitly solved by the above, because you always get
the same result as you do on one node with a well proven RNG. If you
generate a string of seed and then numbers from those, do you have a proven
RNG?


Luckily, I think that part was solved by people other than me:

L'Ecuyer, P., Simard, R., Chen, E. J. and Kelton, W. D. (2002) An
object-oriented random-number package with many long streams and
substreams. Operations Research 50 1073–5.
http://www.iro.umontreal.ca/~lecuyer/myftp/papers/streams00.pdf




Paul


On 12-02-17 03:57 PM, Paul Johnson wrote:


I've got another edition of my simulation replication framework.  I'm
attaching 2 R files and pasting in the readme.

I would especially like to know if I'm doing anything that breaks
.Random.seed or other things that R's parallel uses in the
environment.

In case you don't want to wrestle with attachments, the same files are
online in our SVN


http://winstat.quant.ku.edu/svn/hpcexample/trunk/Ex66-ParallelSeedPrototype/


## Paul E. Johnson CRMDA
## Portable Parallel Seeds Project.
## 2012-02-18

Portable Parallel Seeds Project

This is how I'm going to recommend we work with random number seeds in
simulations. It enhances work that requires runs with random numbers,
whether runs are in a cluster computing environment or in a single
workstation.

It is a solution for two separate problems.

Problem 1. I scripted up 1000 R runs and need high quality,
unique, replicable random streams for each one. Each simulation
runs separately, but I need to be confident their streams are
not correlated or overlapping. For replication, I need to be able to
select any run, say 667, and restart it exactly as it was.

Problem 2. I've written a Parallel MPI (Message Passing Interface)
routine that launches 1000 runs and I need to assure each has
a unique, replicatable, random stream. I need to be able to
select any run, say 667, and restart it exactly as it was.

This project develops one approach to create replicable simulations.
It blends ideas about seed management from John M. Chambers
Software for Data Analysis (2008) with ideas from the snowFT
package by Hana Sevcikova and Tony R. Rossini.


Here's my proposal.

1. Run a preliminary program to generate an array of seeds

run1:   seed1.1   seed1.2   seed1.3
run2:   seed2.1   seed2.2   seed2.3
run3:   seed3.1   seed3.2   seed3.3
...  ...   ...
run1000   seed1000.1  seed1000.2   seed1000.3

This example provides 3 separate streams of random numbers within each
run. Because we will use the L'Ecuyer "many separate streams"
approach, we are confident that there is no correlation or overlap
between any of the runs.

The projSeeds has to have one row per project, but it is not a huge
file. I created seeds for 2000 runs of a project that requires 2 seeds
per run.  The saved size of the file 104443kb, which is very small. By
comparison, a 1400x1050 jpg image would usually be twice that size.
If you save 10,000 runs-worth of seeds, the size rises to 521,993kb,
still pretty small.

Because the seeds are saved in a file, we are sure each
run can be replicated. We just have to teach each program
how to use the seeds. That is step two.


2. Inside each run, an initialization function runs that loads the
seeds file and takes the row of seeds that it needs.  As the
simulation progresses, the user can ask for random numbers from the
separate streams. When we need random draws from a particular stream,
we set the variable "currentStream" with the function useStream().

The function initSeedStreams creates several objects in
the global environment. It sets the integer currentStream,
as well as two list objects, startSeeds and currentSeeds.
At the outset of the run, startSeeds and currentSeeds
are the same thing. When we change the currentStream
to a different stream, the currentSeeds vector is
updated to remember where that stream was when we stopped
drawing numbers from it.


Now, for the proof of concept. A w

Re: [Rd] portable parallel seeds project: request for critiques

2012-02-17 Thread Petr Savicky
On Fri, Feb 17, 2012 at 02:57:26PM -0600, Paul Johnson wrote:
> I've got another edition of my simulation replication framework.  I'm
> attaching 2 R files and pasting in the readme.
> 
> I would especially like to know if I'm doing anything that breaks
> .Random.seed or other things that R's parallel uses in the
> environment.
> 
> In case you don't want to wrestle with attachments, the same files are
> online in our SVN
> 
> http://winstat.quant.ku.edu/svn/hpcexample/trunk/Ex66-ParallelSeedPrototype/
> 
> 
> ## Paul E. Johnson CRMDA 
> ## Portable Parallel Seeds Project.
> ## 2012-02-18
> 
> Portable Parallel Seeds Project
> 
> This is how I'm going to recommend we work with random number seeds in
> simulations. It enhances work that requires runs with random numbers,
> whether runs are in a cluster computing environment or in a single
> workstation.
> 
> It is a solution for two separate problems.
> 
> Problem 1. I scripted up 1000 R runs and need high quality,
> unique, replicable random streams for each one. Each simulation
> runs separately, but I need to be confident their streams are
> not correlated or overlapping. For replication, I need to be able to
> select any run, say 667, and restart it exactly as it was.
> 
> Problem 2. I've written a Parallel MPI (Message Passing Interface)
> routine that launches 1000 runs and I need to assure each has
> a unique, replicatable, random stream. I need to be able to
> select any run, say 667, and restart it exactly as it was.
> 
> This project develops one approach to create replicable simulations.
> It blends ideas about seed management from John M. Chambers
> Software for Data Analysis (2008) with ideas from the snowFT
> package by Hana Sevcikova and Tony R. Rossini.
> 
> 
> Here's my proposal.
> 
> 1. Run a preliminary program to generate an array of seeds
> 
> run1:   seed1.1   seed1.2   seed1.3
> run2:   seed2.1   seed2.2   seed2.3
> run3:   seed3.1   seed3.2   seed3.3
> ...  ...   ...
> run1000   seed1000.1  seed1000.2   seed1000.3
> 
> This example provides 3 separate streams of random numbers within each
> run. Because we will use the L'Ecuyer "many separate streams"
> approach, we are confident that there is no correlation or overlap
> between any of the runs.
> 
> The projSeeds has to have one row per project, but it is not a huge
> file. I created seeds for 2000 runs of a project that requires 2 seeds
> per run.  The saved size of the file 104443kb, which is very small. By
> comparison, a 1400x1050 jpg image would usually be twice that size.
> If you save 10,000 runs-worth of seeds, the size rises to 521,993kb,
> still pretty small.
> 
> Because the seeds are saved in a file, we are sure each
> run can be replicated. We just have to teach each program
> how to use the seeds. That is step two.

Hi.

Some of the random number generators allow as a seed a vector,
not only a single number. This can simplify generating the seeds.
There can be one seed for each of the 1000 runs and then,
the rows of the seed matrix can be

  c(seed1, 1), c(seed1, 2), ...
  c(seed2, 1), c(seed2, 2), ...
  c(seed3, 1), c(seed3, 2), ...
  ...

There could be even only one seed and the matrix can be generated as

  c(seed, 1, 1), c(seed, 1, 2), ...
  c(seed, 2, 1), c(seed, 2, 2), ...
  c(seed, 3, 1), c(seed, 3, 2), ...

If the initialization using the vector c(seed, i, j) is done
with a good quality hash function, the runs will be independent.

What is your opinion on this?

An advantage of seeding with a vector is also that there can
be significantly more initial states of the generator among
which we select by the seed than 2^32, which is the maximum
for a single integer seed.

Petr Savicky.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] portable parallel seeds project: request for critiques

2012-02-17 Thread Paul Johnson
On Fri, Feb 17, 2012 at 5:06 PM, Petr Savicky  wrote:
> On Fri, Feb 17, 2012 at 02:57:26PM -0600, Paul Johnson wrote:
> Hi.
>
> Some of the random number generators allow as a seed a vector,
> not only a single number. This can simplify generating the seeds.
> There can be one seed for each of the 1000 runs and then,
> the rows of the seed matrix can be
>
>  c(seed1, 1), c(seed1, 2), ...
>  c(seed2, 1), c(seed2, 2), ...
>  c(seed3, 1), c(seed3, 2), ...
>  ...
>
Yes, I understand.

The seed things I'm using are the 6 integer values from the L'Ecuyer.
If you run the example script, the verbose option causes some to print
out.  The first 3 seeds in a saved project seeds file looks like:

> projSeeds[[1]]
[[1]]
[1] 407   376488316  1939487821  1433925148 -1040698333   579503880
[7]  -624878918

[[2]]
[1] 407 -1107332181   854177397  1773099324  1774170776  -266687360
[7]   816955059

[[3]]
[1] 407   936506900 -1924631332 -1380363206  2109234517  1585239833
[7] -1559304513

The 407 in the first position is an integer R uses to note the type of
stream for which the seed is intended, in this case R'Lecuyer.





> There could be even only one seed and the matrix can be generated as
>
>  c(seed, 1, 1), c(seed, 1, 2), ...
>  c(seed, 2, 1), c(seed, 2, 2), ...
>  c(seed, 3, 1), c(seed, 3, 2), ...
>
> If the initialization using the vector c(seed, i, j) is done
> with a good quality hash function, the runs will be independent.
>
I don't have any formal proof that a "good quality hash function"
would truly create seeds from which independent streams will be drawn.

There is, however, the proof in the L'Ecuyer paper that one can take
the long stream and divide it into sections.  That's the approach I'm
taking here. Its the same approach the a parallel package in R
follows, and parallel frameworks like snow.

The different thing in my approach is that I'm saving one row of seeds
per simulation "run".  So each run can be replicated exactly.

I hope.

pj

pj


> What is your opinion on this?
>
> An advantage of seeding with a vector is also that there can
> be significantly more initial states of the generator among
> which we select by the seed than 2^32, which is the maximum
> for a single integer seed.
>
> Petr Savicky.
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel



-- 
Paul E. Johnson
Professor, Political Science
1541 Lilac Lane, Room 504
University of Kansas

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel