I'm trying to do data grouping like you said. I will look into data.table package and I will also consider using a matrix instead of a data frame.
Thank you for your responses. Thanks, Rob On Fri, Feb 26, 2010 at 3:21 PM, Tom Short <tshort.rli...@gmail.com> wrote: > I'm sorry, Rob, but that code is dense enough and formatted badly > enough that it's hard to dig through. > > You may want to try the data.table package. The development version on > R-forge is pretty fast for grouping operations like this. I'm not sure > if this is what you're really after. It's hard to tell from your > example. > > Compare some speeds: > > > dat <- data.frame(D=sample(32000:33000, 666000,T), > + Fid=sample(1:10,666000,T), > + A=sample(1:5,666000,T)) > > > > ####### one of your examples > > system.time(ret <- fedb.ddplyWrapper2(dat, c("D", "Fid"), > + function(x) c(sum(x[,"A"], na.rm=T), > sum(x[,"A"], na.rm=T)))) > user system elapsed > 21.78 14.42 36.35 > > > > > > ####### data.table > > install.packages("data.table",repos="http://R-Forge.R-project.org") > > library(data.table) > > dt <- as.data.table(dat) > > system.time(ret2 <- dt[, sum(A, na.rm=T), by = "D,Fid"]) > user system elapsed > 0.27 0.00 0.28 > > > > > > ####### plyr for comparison, too > > library(plyr) > > system.time(ret3 <- ddply(dat, .(D,Fid), function(x) sum(x$A, na.rm=T))) > user system elapsed > 28.94 12.16 41.23 > > > head(ret) > [,1] [,2] > 1 175 175 > 2 222 222 > 3 221 221 > 4 134 134 > 5 253 253 > 6 194 194 > > > head(ret2) > D Fid V1 > [1,] 32000 1 228 > [2,] 32000 2 209 > [3,] 32000 3 182 > [4,] 32000 4 180 > [5,] 32000 5 181 > [6,] 32000 6 222 > > > head(ret3) > D Fid V1 > 1 32000 1 175 > 2 32000 2 222 > 3 32000 3 221 > 4 32000 4 134 > 5 32000 5 253 > 6 32000 6 194 > > > - Tom > > > On Fri, Feb 26, 2010 at 2:58 PM, Rob Forler <rfor...@uchicago.edu> wrote: > > So I have a function that does lapply's for me based on dimension. > Currently > > only works for length(pivotColumns)=2 because I haven't fixed the rbinds. > I > > have two versions. One runs WAYYY faster than the other. And I'm not sure > > why. > > > > Fast Version: > > > > fedb.ddplyWrapper2Fast <- function(data, pivotColumns, listNameFunctions, > > ...){ > > lapplyFunctionRecurse <- function(cdata, level=1, ...){ > > if(level==1){ > > > > return(lapply(split(seq(nrow(cdata)),cdata[,pivotColumns[level]], > drop=T), > > function(x) lapplyFunctionRecurse(x, level+1, ...))) > > } else if (level==length(pivotColumns)) { > > # > > return(lapply(split(cdata,data[cdata,pivotColumns[level]], drop=T), > > function(x, ...) listNameFunctions(data[x,], ...))) > > return(lapply(split(cdata,data[cdata,pivotColumns[level]], > > drop=T), function(x, ...) c(data[cdata[1],pivotColumns[2]], > > data[cdata[1],pivotColumns[1]], sum(data[cdata,"A"], na.rm=T), > > sum(data[cdata,"A"], na.rm=T)))) > > } else { > > return(lapply(split(cdata,data[cdata,pivotColumns[level]], > > drop=T), function(x) lapplyFunctionRecurse(x, level+1, ...))) > > } > > } > > result = lapplyFunctionRecurse(data, ...) > > matrix2 <- do.call('rbind', lapply(result, function(x) > > do.call('rbind',x))) > > return(matrix2) > > } > > > > > > dat <- data.frame(D=sample(32000:33000, 666000, > > T),Fid=sample(1:10,666000,T), A=sample(1:5,666000,T)) > >> temp = proc.time(); ret = fedb.ddplyWrapper2(dat, c("D", "Fid"), > > function(x) c(sum(x[,"A"], na.rm=T), sum(x[,"A"], na.rm=T))); > > proc.time()-temp > > user system elapsed > > 4.616 0.006 4.630 > > #note in thie case the anonymous function I pass in isn't used because I > > hardcode the function into the lapply. > > > > approx 4 seconds > > > > This runs very fast. This runs very slow: > > > > fedb.ddplyWrapper2 <- function(data, pivotColumns, listNameFunctions, > ...){ > > lapplyFunctionRecurse <- function(cdata, level=1, ...){ > > if(level==1){ > > > > return(lapply(split(seq(nrow(cdata)),cdata[,pivotColumns[level]], > drop=T), > > function(x) lapplyFunctionRecurse(x, level+1, ...))) > > } else if (level==length(pivotColumns)) { > > #this line is different. it essentially calls the function you > > pass in > > return(lapply(split(cdata,data[cdata,pivotColumns[level]], > > drop=T), function(x, ...) listNameFunctions(data[x,], ...))) > > } else { > > return(lapply(split(cdata,data[cdata,pivotColumns[level]], > > drop=T), function(x) lapplyFunctionRecurse(x, level+1, ...))) > > } > > } > > result = lapplyFunctionRecurse(data, ...) > > matrix2 <- do.call('rbind', lapply(result, function(x) > > do.call('rbind',x))) > > return(matrix2) > > } > > > > dat <- data.frame(D=sample(32000:33000, 666000, > > T),Fid=sample(1:10,666000,T), A=sample(1:5,666000,T)) > >> temp = proc.time(); ret = fedb.ddplyWrapper2(dat, c("D", "Fid"), > > function(x) c(sum(x[,"A"], na.rm=T), sum(x[,"A"], na.rm=T))); > > proc.time()-temp > > user system elapsed > > 16.346 65.059 81.680 > > > > head(ret3) > D Fid V1 > 1 32000 1 175 > 2 32000 2 222 > 3 32000 3 221 > 4 32000 4 134 > 5 32000 5 253 > 6 32000 6 194 > > > > > > Can anyone explain to me why there is a 4x time difference? I don't want > to > > have to hardcore into the recursion function, but if I have to I will. > > > > Thanks, > > Rob > > > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-help@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.