[R] How to optimize this codes ? Daren Tan daren76 at hotmail.com Thu Dec 4 17:02:49 CET 2008
How to optimize the for-loop to be reasonably fast for sample.size=100000000 ? You may want to change sample.size=1000 to have an idea what I am achieving. set.seed(143) A <- matrix(sample(0:1, sample.size, TRUE), ncol=10, dimnames=list(NULL, LETTERS[1:10])) B <- list() for(i in 1:10) { B[[i]] <- apply(combn(LETTERS[1:10], i), 2, function(x) { sum(apply(data.frame(A[,x]), 1, all)) }) } Instead computing all(A[row,x]) each row of the matrix by looping over the rows you could loop over the columns. That is generally quicker if there are many more rows than columns. S+ and R have functions called pmin and pmax that do this for min and max, but no pany or pall functions. In your 0/1 case all is the same as min so you can replace apply(data.frame(A[,x]), 1, all) with sum(do.call("pmin", unname(A[,x,drop=FALSE]))) after first converting A to a data.frame before starting to compute B. (The do.call/unname combo is a hack to pass all the columns of a data.frame as separate arguments to a function. If pmin took a list that wouldn't be necessary.) When you do timings, looking at one size of problem often isn't helpful, as it doesn't tell you how the time depends on the size of the input. The relationship often is not linear. E.g., I put your method in a function called computeB0 and my modification of it into computeB1 and wrote a makeA that makes an A matrix with a given sample.size. Here are the times, it looks like 1000 is below the linear range for this example: sample.size time:computeB0 time:computeB1 1000 5.36 0.98 10000 36.82 2.49 100000 381.34 18.97 and here are the functions used makeA <- function(sample.size){ set.seed(143); A<-matrix(sample(0:1,size=sample.size,replace=TRUE), ncol=10, dimnames=list(NULL, LETTERS[1:10])); A } computeB0 <- function(A){B<-list() for(i in 1:10) { B[[i]] <- apply(combn(LETTERS[1:10], i), 2, function(x) { sum(apply(data.frame(A[,x]), 1, all)) }) } B } computeB1 <- function(A){ A<-as.data.frame(A) B<-list() for(i in 1:10) { B[[i]] <- apply(combn(LETTERS[1:10], i), 2, FUN=function(x) { sum(do.call("pmin", unname(A[,x,drop=FALSE]))) } ) } B } You could probably same more time by restructuring this as a recursive function, still operating on columns, that traversed the binary tree of column inclusion/exclusion. I just wanted to point out the advantage of looping over columns instead of looping over rows. sample.size=1e8 is pretty big for a 32-bit machine. You may have to do things like make A a data.frame (with integer columns) from the start to keep the memory requirements down. Putting the call to data.frame() inside a nested set of apply loop is a waste of time. Bill Dunlap TIBCO Software Inc - Spotfire Division wdunlap tibco.com ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.