Hi, Jim's method was found to be faster than data.table()
n <- 10000 nLevels <- 10 nRows <- 120 Cols <- list(rep(list(sample(nRows)), n)) df <- data.frame(levels = sample(nLevels, nRows, TRUE), Cols) colnames(df)[-1] <- paste0('col', 1:n) # convert to matrix for faster processing df.m <- as.matrix(df[, -1]) # remove levels column system.time({ # split the indices of rows for each level x <- split(seq(nrow(df)), df$levels) result <- sapply(x, function(a) colMeans(df.m[a, ])) }) # user system elapsed # 0.056 0.000 0.056 library(data.table) df.dt<-data.table(df) setkey(df.dt,levels) system.time({ result1<- df.dt[,lapply(.SD,mean),by=levels]}) # user system elapsed # 7.756 0.000 7.771 system.time({result2<-df.dt[,list(Mean=colMeans(.SD)),by=levels]}) # user system elapsed # 2.188 0.000 2.193 A.K. ----- Original Message ----- From: jim holtman <jholt...@gmail.com> To: Martin Batholdy <batho...@googlemail.com> Cc: "r-help@r-project.org" <r-help@r-project.org> Sent: Tuesday, December 25, 2012 1:20 PM Subject: Re: [R] aggregate / collapse big data frame efficiently According to the way that you have used 'aggregate', you are taking the column means. Couple of suggestions for faster processing: 1. use matrices instead of data.frames ( i converted your example just before using it) 2, use the 'colMeans' I created a 120 x 100000 matrix with 10 levels and its does the computation in less than 2 seconds: > n <- 100000 > nLevels <- 10 > nRows <- 120 > Cols <- list(rep(list(sample(nRows)), n)) > df <- data.frame(levels = sample(nLevels, nRows, TRUE), Cols) > colnames(df)[-1] <- paste0('col', 1:n) > > # convert to matrix for faster processing > df.m <- as.matrix(df[, -1]) # remove levels column > str(df.m) int [1:120, 1:100000] 111 13 106 61 16 39 25 94 53 38 ... - attr(*, "dimnames")=List of 2 ..$ : NULL ..$ : chr [1:100000] "col1" "col2" "col3" "col4" ... > system.time({ + # split the indices of rows for each level + x <- split(seq(nrow(df)), df$levels) + result <- sapply(x, function(a) colMeans(df.m[a, ])) + }) user system elapsed 1.33 0.00 1.35 > str(result) num [1:100000, 1:10] 57 57 57 57 57 57 57 57 57 57 ... - attr(*, "dimnames")=List of 2 ..$ : chr [1:100000] "col1" "col2" "col3" "col4" ... ..$ : chr [1:10] "1" "2" "3" "4" ... > On Tue, Dec 25, 2012 at 11:34 AM, Martin Batholdy <batho...@googlemail.com> wrote: > Hi, > > > I need to aggregate rows of a data.frame by computing the mean for rows with > the same factor-level on one factor-variable; > > here is the sample code: > > > x <- data.frame(rep(letters,2), rnorm(52), rnorm(52), rnorm(52)) > > aggregate(x, list(x[,1]), mean) > > > Now my problem is, that the actual data-set is much bigger (120 rows and > approximately 100.000 columns) – and it takes very very long (actually at > some point I just stopped it). > > Is there anything that can be done to make the aggregate routine more > efficient? > Or is there a different approach that would work faster? > > > Thanks for any suggestions! > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. -- Jim Holtman Data Munger Guru What is the problem that you are trying to solve? Tell me what you want to do, not how you want to do it. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.