[R] help with execution of 'embarrassingly parallel' problem using foreach, doParallel on a windows system

Benjamin Caldwell Tue, 23 Apr 2013 17:02:20 -0700

Dear R helpers,

I have what another member on this forum described as
an embarrassingly parallel problem. I am trying to fit models on subsets of
some data based on unique combinations of two id factors in the dataset.
Total number of combinations is 30^5, and this takes a long time. So, I
would like fit models for each of the datasets produced by subsetting on
the unique combinations, splitting up the matrix of combinations into
pieces based on how many cores I have available to me (e.g. eight cores,
eight chunks).


I spent much of the day today trying to wrap myself around an expedient way
to do this on a windows system in R (seems like the 'best' way to do this
on systems that support forking is much more settled), and picked foreach
and doParallel based on the package documentation for foreach; I'm open to
other suggestions.

Having tried it out though, it doesn't seem to be doing anything to speed
up the task - indeed, it takes more time! Also, compiling the functions
results in the 'winner' in terms of the serial and (I
hope) parallel implementations being reversed.

Most importantly, what have I done wrong here and how can I make this work?

What's the interaction of compile and foreach I'm seeing?

How can I tell whether multi-threading is occurring?

Suggestions for another approach?

Example:

###############################################################################################
require(foreach)
require(doParallel)
require(compile)
require(compiler)
cl <- makeCluster(detectCores()-1)
registerDoParallel(cl)
# foreach(i=1:100) %dopar% sqrt(i)


d <- sort(rep(letters[1:24], 5))
e <- rep(1:24, 5)
rand.int <- rnorm(n=length(e),mean=e, sd=4)
f <- 3+ (e*rand.int)^2
g <- sort(rep(1:6, 20))

one <- data.frame(d,e,f,g)
names(one) <- c('block1', 'ind', 'res','block2')

one[1:50,]

two <- expand.grid(one[,1], one[,4]) #actually is
expand.grid(unique(one[,1]), unique(one[,4]))

str(two)
names(two) <- c('block1', 'block2')

fitting <- function(ndx.grd=two,dt.grd=one,ind.vr='ind',rsp.vr='res') {
ind.start<-10^8
item.out <- matrix(NA, ncol=3)
for(i in 1:length(ndx.grd[,1])){
 tmp1 <- as.character(ndx.grd[i,1])
tmp2 <- as.character(ndx.grd[i,2])
wk.grd <- dt.grd[as.character(dt.grd[,1])==tmp1 &
as.character(dt.grd[,4])==tmp2,]
# browser()
try(ind.out <- summary(lm(ind~res, data=wk.grd))$sigma, silent=TRUE)
 if(ind.out < ind.start) {
item.out[,1] <- tmp1
item.out[,2] <- tmp2
item.out[,3] <- ind.out
}
}
return(item.out)
}




fitting.c<-cmpfun(fitting)

#compiled
Rprof('myFunction.out', memory.profiling=T)
y <- fitting.c()
Rprof(NULL)
summaryRprof('myFunction.out', memory='both')
system.time(fitting.c())

Rprof('myFunction.out', memory.profiling=T)
y <- foreach(icount(length(two))) %dopar% fitting.c()
Rprof(NULL)
summaryRprof('myFunction.out', memory='both')
system.time(foreach(icount(length(two))) %dopar% fitting.c())


#uncompiled
Rprof('myFunction.out', memory.profiling=T)
y <- fitting()
Rprof(NULL)
summaryRprof('myFunction.out', memory='both')
system.time(fitting())

Rprof('myFunction.out', memory.profiling=T)
y <- foreach(icount(length(two))) %dopar% fitting()
Rprof(NULL)
summaryRprof('myFunction.out', memory='both')
system.time(foreach(icount(length(two))) %dopar% fitting())
###############################################################################################


*
*
*Ben Caldwell*

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] help with execution of 'embarrassingly parallel' problem using foreach, doParallel on a windows system

Reply via email to