I think I found a solution. I do not like to use global variable by fear of 
unpredictable side-effects but, I think that in this case I don't have to much 
chance.

Here is a mock function that pushes the content of a variable evaluated within 
a function to the nodes on the cluster, do some computation on the nodes using 
that variable and then return the result after cleaning up the newly created 
global variable.

Let me know what you people think:

aTest <- function(x,n.nodes=2){
  library(snow)

  #initialize a cluster
  makeCluster(rep('locahost',n.nodes),type='SOCK')

  #create a global variable
  y <<- x

  #export the variable to the cluster
  clusterExport(cl,'y')

  #do some computation on the cluster
  c <- clusterEvalQ(cl,y+2)

  #remove the variable from the global environment
  rm(y, envir=.GlobalEnv)

  #stop the cluster
  stopCluster(cl)

  #exit and return the computation
  return(c)
}


On 11/29/08 6:59 PM, "Marco Blanchette" <[EMAIL PROTECTED]> wrote:

Dear R gurus,

I have a very embarrassingly parallelizable job that I am trying to speed up 
with snow on our local cluster. Basically, I am doing ~50,000 t.test for a 
series of micro-array experiments, one gene at a time. Thus, I can easily 
spread the load across multiple processors and nodes.

So, I have a master list object that tells me what rows to pick up for each 
genes to do the t.test from series of microarray experiments containing 
~500,000 rows and x columns per experiments.

While trying to optimize my function using parLapply(), I quickly realized that 
I was not gaining any speed because every time a test was done on one of the 
item in the list, the 500,000 line by x column matrix had to be shipped along 
with the item in the list and the traffic time was actually longer than the 
computing time.

However, if I export the 500,000 object first across the spawned processes as 
in this mock script

cl <- makeCluster(nnodes,method)
mArrayData <- getData(experiments)
clusterExport(cl, 'mArrayData')

Results <- parLapply(cl, theMapList, function(x) t.testFnc(x))

With a function that define the mArrayData argument as a default parameter as in

t.testFnc <- function(probeList, array=mArrayData){
    x <- array[probeList$A,]
    y <- array[probeList$B,]
     res <- doSomeTest(x,y)
    return(res)
}

Using this strategy, I was able to gain full advantage of my cluster and reduce 
the analysis time by the number of nodes I have in our cluster. The large data 
matrix was resident in each processes and didn't have to travel on the network 
every time a item from the list was pass to the function t.testFnc()

However, I quickly realized that this works (the call to clusterExport() ) only 
when I run the script one line at a time. When the process is enclosed in a 
function, the object mArrayData is not exported, presumably because it's not a 
global object from the Master process.

So, what is the alternative to push the content of an object to the slaves? The 
documentation in the snow package is a bit light and I couldn't find good 
example out there. I don't want to have the function getData() evaluated on 
each nodes because the argument to that functions are humongous and that would 
cause way too much traffic on the network. I want the result of the function 
getData(), the object mArrayData, propagated to the cluster only once and be 
available to downstream functions.

Hope this is clear and that a solution will be possible.

Many thanks

Marco

--
Marco Blanchette, Ph.D.
Assistant Investigator
Stowers Institute for Medical Research
1000 East 50th St.

Kansas City, MO 64110

Tel: 816-926-4071
Cell: 816-726-8419
Fax: 816-926-2018

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


--
Marco Blanchette, Ph.D.
Assistant Investigator
Stowers Institute for Medical Research
1000 East 50th St.

Kansas City, MO 64110

Tel: 816-926-4071
Cell: 816-726-8419
Fax: 816-926-2018

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to