Re: [R] dramatic speed difference in lapply

Rob Forler Fri, 26 Feb 2010 13:33:23 -0800

I'm trying to do data grouping like you said. I will look into data.table
package and I will also consider using a matrix instead of a data frame.


Thank you for your responses.

Thanks,
Rob

On Fri, Feb 26, 2010 at 3:21 PM, Tom Short <tshort.rli...@gmail.com> wrote:

> I'm sorry, Rob, but that code is dense enough and formatted badly
> enough that it's hard to dig through.
>
> You may want to try the data.table package. The development version on
> R-forge is pretty fast for grouping operations like this. I'm not sure
> if this is what you're really after. It's hard to tell from your
> example.
>
> Compare some speeds:
>
> > dat <- data.frame(D=sample(32000:33000, 666000,T),
> +                   Fid=sample(1:10,666000,T),
> +                   A=sample(1:5,666000,T))
> >
> > ####### one of your examples
> > system.time(ret <- fedb.ddplyWrapper2(dat, c("D", "Fid"),
> +                     function(x) c(sum(x[,"A"], na.rm=T),
> sum(x[,"A"], na.rm=T))))
>   user  system elapsed
>  21.78   14.42   36.35
> >
> >
> > ####### data.table
> > install.packages("data.table",repos="http://R-Forge.R-project.org";)
> > library(data.table)
> > dt <- as.data.table(dat)
> > system.time(ret2 <- dt[, sum(A, na.rm=T), by = "D,Fid"])
>   user  system elapsed
>   0.27    0.00    0.28
> >
> >
> > ####### plyr for comparison, too
> > library(plyr)
> > system.time(ret3 <- ddply(dat, .(D,Fid), function(x) sum(x$A, na.rm=T)))
>   user  system elapsed
>  28.94   12.16   41.23
>
> > head(ret)
>  [,1] [,2]
> 1  175  175
> 2  222  222
> 3  221  221
> 4  134  134
> 5  253  253
> 6  194  194
>
> > head(ret2)
>         D Fid  V1
> [1,] 32000   1 228
> [2,] 32000   2 209
> [3,] 32000   3 182
> [4,] 32000   4 180
> [5,] 32000   5 181
> [6,] 32000   6 222
>
> > head(ret3)
>      D Fid  V1
> 1 32000   1 175
> 2 32000   2 222
> 3 32000   3 221
> 4 32000   4 134
> 5 32000   5 253
> 6 32000   6 194
>
>
> - Tom
>
>
> On Fri, Feb 26, 2010 at 2:58 PM, Rob Forler <rfor...@uchicago.edu> wrote:
> > So I have a function that does lapply's for me based on dimension.
> Currently
> > only works for length(pivotColumns)=2 because I haven't fixed the rbinds.
> I
> > have two versions. One runs WAYYY faster than the other. And I'm not sure
> > why.
> >
> > Fast Version:
> >
> > fedb.ddplyWrapper2Fast <- function(data, pivotColumns, listNameFunctions,
> > ...){
> >    lapplyFunctionRecurse <- function(cdata, level=1, ...){
> >        if(level==1){
> >
> > return(lapply(split(seq(nrow(cdata)),cdata[,pivotColumns[level]],
> drop=T),
> > function(x) lapplyFunctionRecurse(x, level+1, ...)))
> >        } else if (level==length(pivotColumns)) {
> >            #
> > return(lapply(split(cdata,data[cdata,pivotColumns[level]], drop=T),
> > function(x, ...) listNameFunctions(data[x,], ...)))
> >            return(lapply(split(cdata,data[cdata,pivotColumns[level]],
> > drop=T), function(x, ...) c(data[cdata[1],pivotColumns[2]],
> > data[cdata[1],pivotColumns[1]], sum(data[cdata,"A"], na.rm=T),
> > sum(data[cdata,"A"], na.rm=T))))
> >        } else {
> >            return(lapply(split(cdata,data[cdata,pivotColumns[level]],
> > drop=T), function(x) lapplyFunctionRecurse(x, level+1, ...)))
> >        }
> >    }
> >    result = lapplyFunctionRecurse(data, ...)
> >    matrix2 <- do.call('rbind', lapply(result, function(x)
> > do.call('rbind',x)))
> >    return(matrix2)
> > }
> >
> >
> > dat <- data.frame(D=sample(32000:33000, 666000,
> > T),Fid=sample(1:10,666000,T), A=sample(1:5,666000,T))
> >> temp = proc.time(); ret = fedb.ddplyWrapper2(dat, c("D", "Fid"),
> > function(x) c(sum(x[,"A"], na.rm=T), sum(x[,"A"], na.rm=T)));
> > proc.time()-temp
> >   user  system elapsed
> >  4.616   0.006   4.630
> > #note in thie case the anonymous function I pass in isn't used because I
> > hardcode the function into the lapply.
> >
> > approx 4 seconds
> >
> > This runs very fast. This runs very slow:
> >
> > fedb.ddplyWrapper2 <- function(data, pivotColumns, listNameFunctions,
> ...){
> >    lapplyFunctionRecurse <- function(cdata, level=1, ...){
> >        if(level==1){
> >
> > return(lapply(split(seq(nrow(cdata)),cdata[,pivotColumns[level]],
> drop=T),
> > function(x) lapplyFunctionRecurse(x, level+1, ...)))
> >        } else if (level==length(pivotColumns)) {
> >            #this line is different. it essentially calls the function you
> > pass in
> >            return(lapply(split(cdata,data[cdata,pivotColumns[level]],
> > drop=T), function(x, ...) listNameFunctions(data[x,], ...)))
> >        } else {
> >            return(lapply(split(cdata,data[cdata,pivotColumns[level]],
> > drop=T), function(x) lapplyFunctionRecurse(x, level+1, ...)))
> >        }
> >    }
> >    result = lapplyFunctionRecurse(data, ...)
> >    matrix2 <- do.call('rbind', lapply(result, function(x)
> > do.call('rbind',x)))
> >    return(matrix2)
> > }
> >
> > dat <- data.frame(D=sample(32000:33000, 666000,
> > T),Fid=sample(1:10,666000,T), A=sample(1:5,666000,T))
> >> temp = proc.time(); ret = fedb.ddplyWrapper2(dat, c("D", "Fid"),
> > function(x) c(sum(x[,"A"], na.rm=T), sum(x[,"A"], na.rm=T)));
> > proc.time()-temp
> >   user  system elapsed
> >  16.346  65.059  81.680
> >
> > head(ret3)
>      D Fid  V1
> 1 32000   1 175
> 2 32000   2 222
> 3 32000   3 221
> 4 32000   4 134
> 5 32000   5 253
> 6 32000   6 194
> >
> >
> > Can anyone explain to me why there is a 4x time difference? I don't want
> to
> > have to hardcore into the recursion function, but if I have to I will.
> >
> > Thanks,
> > Rob
> >
> >        [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] dramatic speed difference in lapply

Reply via email to