[Rd] parallel::pvec FUN types differ when v is a list; code simplifications?

Martin Morgan Thu, 25 Oct 2012 21:37:01 -0700

In pvec(list(1, 2), FUN, mc.cores=2) FUN sees integer() arguments whereaspvec(list(1, 2, 3), FUN, mc.cores=2) FUN sees list() arguments; the latter seemsconsistent with pvec's description.

This came up in a complicated Bioconductor thread about generics and parallelevaluation


  https://stat.ethz.ch/pipermail/bioc-devel/2012-October/003745.html

One relevant point is that a light-weight re-write of parallel/R/unix/pvec.R(below) appears to avoid the need for an object v to satisfy


  is.vector
  as.list
  [

and instead only requires

  [

This can be important when as.list() forces an inefficient representation of anobject that can nonetheless be subset with [. It also seems like the code belowwill be more space efficient, since v is not split into a second object in theparent but only subset in the children?


Index: pvec.R
===================================================================
--- pvec.R      (revision 61012)
+++ pvec.R      (working copy)
@@ -21,8 +21,6 @@
  pvec <- function(v, FUN, ..., mc.set.seed = TRUE, mc.silent = FALSE,
                   mc.cores = getOption("mc.cores", 2L), mc.cleanup = TRUE)
  {
-    if (!is.vector(v)) stop("'v' must be a vector")
-
      env <- parent.frame()
      cores <- as.integer(mc.cores)
      if(cores < 1L) stop("'mc.cores' must be >= 1")
@@ -31,16 +29,7 @@
      if(mc.set.seed) mc.reset.stream()

      n <- length(v)
-    l <- if (n <= cores) as.list(v) else {
-        ## compute the scheduling, making it as fair as possible
-        il <- as.integer(n / cores)
-        xc <- n - il * cores
-        sl <- rep(il, cores)
-        if (xc) sl[1:xc] <- il + 1L
-        si <- cumsum(c(1L, sl))
-        se <- si + c(sl, 0L) - 1L
-        lapply(seq_len(cores), function(ix) v[si[ix]:se[ix]])
-    }
+    si <- splitIndices(n, cores)
      jobs <- NULL
      cleanup <- function() {
          ## kill children if cleanup is requested
@@ -59,8 +48,8 @@
      on.exit(cleanup())
      FUN <- match.fun(FUN)
      ## may have more cores than tasks ....
-    jobs <- lapply(seq_len(min(n, cores)),
-                   function(i) mcparallel(FUN(l[[i]], ...), name = i,
+    jobs <- lapply(si,
+                   function(i) mcparallel(FUN(v[i], ...),
                                            mc.set.seed = mc.set.seed,
                                            silent = mc.silent))
      res <- mccollect(jobs)


On 10/24/2012 05:07 PM, Cook, Malcolm wrote:

On 10/24/12 12:44 AM, "Michael Lawrence" <lawrence.mich...@gene.com> wrote:

I agree that it would fruitful to have parLapply in BiocGenerics. It looks
to be a flexible abstraction and its presence in the parallel package
makes
it ubiquitous. If it hasn't been done already, mclapply (and mcmapply)
would be good candidates, as well. The fork-based parallelism is
substantively different in terms of the API from the more general
parallelism of parLapply.

Someone was working on some more robust and convenient wrappers around
mclapply. Did that ever see the light of day?



If you are referring to
http://thread.gmane.org/gmane.science.biology.informatics.conductor/43660

in which I had offered some small changes to parallel::pvec

        https://gist.github.com/3757873/

and after which Martin had provided me with a route I have not (yet?)
followed toward submitting a patch to R for consideration by R-devel /
Simon Urbanek in

http://grokbase.com/t/r/bioc-devel/129rbmxp5b/applying-over-granges-and-oth
er-vectors-of-ranges#201209248dcn0tpwt7k7g9zsjr4dha6f1c

On Tue, Oct 23, 2012 at 12:13 PM, Steve Lianoglou <
mailinglist.honey...@gmail.com**> wrote:

  In response to a question from yesterday, I pointed someone to the

ShortRead `srapply` function and I wondered to myself why it had to
necessarily by "burried" in the ShortRead package (aside from it
having a `sr` prefix).

I don't know that srapply necessarily 'got it right'...



One thing I like about srapply is its support for a reduce argument.

I had thought it might be a good idea to move that (or something like
that) to BiocGenerics (unless implementations aren't allowed there)
but also realized that it would add more dependencies where someone
might not necessarily need them.


But, almost surely, a large majority of the people will be happy to do
some form of ||-ization, so in my mind it's not such an onerous thing
to add -- on the other hand, this large majority is probably enriched
for people who are doing NGS analysis, in which case, keeping it in
ShortRead can make some sense.


I remain confused about the need for putting any of this into BiocGenerics
at all.  It seems to me that properly construed parallization primitives
ought to 'just work' with any object which supports indexing and length.

I would appreciate hearing arguments to the contrary.

Florian, in a similar vein, could we not seek to change
parallel::makeCluster to be extensible to, say, support SGE cluster?  THis
seems like the 'right thing to do'.  ???


Regardless, I think we have raised some considerations that might inform
improvements to parallel, including points about error handling, reducing
results, block-level parallization over List/Vector (in addition to
vector), etc.

I think perhaps having a google doc that we can collectively edit to
contain the requirements we are trying to achieve might move us forward
effectively. Would this help? Or perhaps a page under
http://wiki.fhcrc.org/bioc/DeveloperPage/#discussions ???

Taking one step back, I recall some chatter last week (or two) about
some better ||-ization "primitives" -- something about a pvec doo-dad,
and there being ideas to wrap different types of ||-ization behind an
easy to use interface (I think this was the convo), and then I took a
further step back and often wonder why we just don't bite the bullet
and take advantage of the `foreach` infrastructure that is already out
there -- in which case, I could imagne a "doSGE" package that might
handle the particulars of what Florain is referring to. You could then
configure it externally via some `registerDoSGE(some.config.**object)`
and just have the package code happily run it through `foreach(...)
%dopar%` and be done w/ it.


  IMHO it is relevant.  I have not looked for other abstractions, and
this

one seems
to work.  Florian's objectives might be a good test case for adequacy.


The registerDoDah does seem to be a useful abstraction.


Is this not more-or-less the intention of parallel::setDefaultCluster?

--Malcolm


I think there's a lot of work to do for some sort of coordinated
parallelization that putting parLapply into BiocGenerics might
encourage;
not good things will happen when everyone in a call stack tries to
parallelize independently. But I'm in favor of parLapply in
BiocGenerics at
least for the moment.

Martin


  ... at least, I thought this is what was being talked about here (and

popped up a week or two ago) -- sorry if I completely missed the mark
...

-steve


On Tue, Oct 23, 2012 at 10:38 AM, Hahne, Florian
<florian.ha...@novartis.com> wrote:

Hi Martin,
I could define the generics in my own package, but that would mean
that
those will only be available there, or in the global environment
assuming
that I also export them, or in all additional packages that
explicitly
import them from my name space. Now there already are a whole bunch
of
packages around that all allow for parallelization via a cluster
object.
Obviously those all import the parLapply function from the parallel
package. That means that I can't simply supply my own modified
cluster
object, because the code that calls parLapply will not know about the
generic in my package, even if it is attached. Ideally parLapply
would
be
a generic function already in the parallel package. Not sure who
needs
to
be convinced in order for this to happen, but my gut feeling was
that it
could be easier to have the generic in BiocGenerics.
Maybe I am missing something obvious here, but imo there is no way to
overwrite parLapply globally for my own class unless the generic is
imported by everyone who wants to make use of the special method.
Florian
--






On 10/23/12 2:20 PM, "Martin Morgan" <mtmor...@fhcrc.org> wrote:

  On 10/17/2012 05:45 AM, Hahne, Florian wrote:

Hi all,
I was wondering whether it would be possible to have proper
generics

for

some of the functions in the parallel package, e.g. parLapply and

clusterCall. The reason I am asking is because I want to build an
S4
class
that essentially looks like an S3 cluster object but knows how to
deal
with the SGE. That way I can abstract away all the overhead
regarding
job
submission, job status and reducing the results in the parLapply
method
of
that class, and would be able to supply this new cluster object to
all
of
my existing functions that can be processed in parallel using a
cluster
object as input. I have played around with the BatchJobs package
as an
abstraction layer to SGE and that work nicely. As a test case I
have
created the necessary generics myself in order to supply my own
SGEcluster
object to a function that normally deals with the "regular"
parallel
package S3 cluster objects and everything just worked out of the
box,
but
obviously this fails once I am in a name space and my generic is
not
found
anymore. Of course what we would really want is some proper
abstraction
of
parallelization in R, but for now this seem to be at least a cheap
compromise. Any thoughts on this?


Hi Florian -- we talked about this locally, but I guess we didn't
actually send
any email!

Is there an obstacle to promoting these to generics in your own
package?
The
usual motivation for inclusion in BiocGenerics has been to avoid
conflicts
between packages, but I'm not sure whether this is the case (yet)?
This
would
also add a dependency fairly deep in the hierarchy.

What do you think?

Martin

  Florian


--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793


______________________________**_________________
bioc-de...@r-project.org mailing list

https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.c
h/mailman/listinfo/bioc-devel>




--
Steve Lianoglou
Graduate Student: Computational Systems Biology
   | Memorial Sloan-Kettering Cancer Center
   | Weill Medical College of Cornell University
Contact Info:
http://cbio.mskcc.org/~lianos/**contact<http://cbio.mskcc.org/%7Elianos
/contact>

______________________________**_________________
bioc-de...@r-project.org mailing list

https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch
/mailman/listinfo/bioc-devel>

         [[alternative HTML version deleted]]

______________________________**_________________
bioc-de...@r-project.org mailing list

https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch/
mailman/listinfo/bioc-devel>


--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793

______________________________**_________________
bioc-de...@r-project.org mailing list

https://stat.ethz.ch/mailman/**listinfo/bioc-devel<https://stat.ethz.ch/m
ailman/listinfo/bioc-devel>


        [[alternative HTML version deleted]]

_______________________________________________
bioc-de...@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel



--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] parallel::pvec FUN types differ when v is a list; code simplifications?

Reply via email to