Here's how I would have done the data.table method. It's a bit faster than the ave approach on my machine:
> # install.packages("data.table",repos="http://R-Forge.R-project.org") > library(data.table) > > f3 <- function(frame) { + frame <- as.data.table(frame) + frame[, lapply(.SD[,2:ncol(.SD), with = FALSE], + function(x) x / mean(x, na.rm = TRUE)), + by = "group"] + } > > system.time(new.frame2 <- f2(frame)) # ave user system elapsed 0.50 0.08 1.24 > system.time(new.frame3 <- f3(frame)) # data.table user system elapsed 0.25 0.01 0.30 - Tom Tom Short On Wed, Apr 7, 2010 at 12:46 PM, Dimitri Liakhovitski <ld7...@gmail.com> wrote: > I would like to thank once more everyone who helped me with this question. > I compared the speed for different approaches. Below are the results > of my comparisons - in case anyone is interested: > > ### Building an EXAMPLE FRAME with N rows - with groups and a lot of NAs: > N<-100000 > set.seed(1234) > frame<-data.frame(group=rep(paste("group",1:10),N/10),a=rnorm(1:N),b=rnorm(1:N),c=rnorm(1:N),d=rnorm(1:N),e=rnorm(1:N),f=rnorm(1:N),g=rnorm(1:N)) > frame<-frame[order(frame$group),] > > ## Introducing 60% NAs: > names.used<-names(frame)[2:length(frame)] > set.seed(1234) > for(i in names.used){ > i.for.NA<-sample(1:N,round((N*.6),0)) > frame[[i]][i.for.NA]<-NA > } > lapply(frame[2:8], function(x) length(x[is.na(x)])) # Checking that it worked > ORIGframe<-frame ## placeholder for the unchanged original frame > > ####### Objective of the code - divide each value by its group mean #### > > ### METHOD 1 - the FASTEST - using ave():############################## > frame<-ORIGframe > f2 <- function(frame) { > for(i in 2:ncol(frame)) { > frame[,i] <- ave(frame[,i], frame[,1], > FUN=function(x)x/mean(x,na.rm=TRUE)) > } > frame > } > system.time({new.frame<-f2(frame)}) > # Took me 0.23-0.27 sec > ####################################### > > ### METHOD 2 - fast, just a bit slower - using data.table: > ############################## > > # If you don't have it - install the package - NOT from CRAN: > install.packages("data.table",repos="http://R-Forge.R-project.org") > library(data.table) > frame<-ORIGframe > system.time({ > table<-data.table(frame) > colMeanFunction<-function(data,key){ > data[[key]]=NULL > ret=as.matrix(data)/matrix(rep(as.numeric(colMeans(as.data.frame(data),na.rm=T)),nrow(data)),nrow=nrow(data),ncol=ncol(data),byrow=T) > return(ret) > } > groupedMeans = table[,colMeanFunction(.SD, "group"), by="group"] > names.to.use<-names(groupedMeans) > for(i in > 1:length(groupedMeans)){groupedMeans[[i]]<-as.data.frame(groupedMeans[[i]])} > groupedMeans<-do.call(cbind, groupedMeans) > names(groupedMeans)<-names.to.use > }) > # Took me 0.37-.45 sec > ####################################### > > ### METHOD 3 - fast, a tad slower (using model.matrix & matrix > multiplication):############################## > frame<-ORIGframe > system.time({ > mat <- as.matrix(frame[,-1]) > mm <- model.matrix(~0+group,frame) > col.grp.N <- crossprod( !is.na(mat), mm ) # Use this line if don't > want to use NAs for mean calculations > # col.grp.N <- crossprod( mat != 0 , mm ) # Use this line if don't > want to use zeros for mean calculations > mat[is.na(mat)] <- 0.0 > col.grp.sum <- crossprod( mat, mm ) > mat <- mat / ( t(col.grp.sum/col.grp.N)[ frame$group,] ) > is.na(mat) <- is.na(frame[,-1]) > mat<-as.data.frame(mat) > }) > # Took me 0.44-0.50 sec > ####################################### > > ### METHOD 5- much slower - it's the one I started > with:############################## > frame<-ORIGframe > system.time({ > frame <- do.call(cbind, lapply(names.used, function(x){ > unlist(by(frame, frame$group, function(y) y[,x] / mean(y[,x],na.rm=T))) > })) > }) > # Took me 1.25-1.32 min > ####################################### > > ### METHOD 6 - the slowest; using "plyr" and > "ddply":############################## > frame<-ORIGframe > library(plyr) > function3 <- function(x) x / mean(x, na.rm = TRUE) > system.time({ > grouping.factor<-"group" > myvariables<-names(frame)[2:8] > frame3<-ddply(frame, grouping.factor, colwise(function3, myvariables)) > }) > # Took me 1.36-1.47 min > ####################################### > > > Thanks again! > Dimitri > > > On Wed, Mar 31, 2010 at 8:29 PM, William Dunlap <wdun...@tibco.com> wrote: >> Dimitri, >> >> You might try applying ave() to each column. E.g., use >> >> f2 <- function(frame) { >> for(i in 2:ncol(frame)) { >> frame[,i] <- ave(frame[,i], frame[,1], >> FUN=function(x)x/mean(x,na.rm=TRUE)) >> } >> frame >> } >> >> Note that this returns a data.frame and retains the >> grouping column (the first) while your original >> code returns a matrix without the grouping column. >> >> Bill Dunlap >> Spotfire, TIBCO Software >> wdunlap tibco.com >> >>> -----Original Message----- >>> From: r-help-boun...@r-project.org >>> [mailto:r-help-boun...@r-project.org] On Behalf Of Bert Gunter >>> Sent: Tuesday, March 30, 2010 10:52 AM >>> To: 'Dimitri Liakhovitski'; 'r-help' >>> Subject: Re: [R] Code is too slow: mean-centering variables >>> in a data framebysubgroup >>> >>> ?scale >>> >>> Bert Gunter >>> Genentech Nonclinical Biostatistics >>> >>> >>> >>> -----Original Message----- >>> From: r-help-boun...@r-project.org >>> [mailto:r-help-boun...@r-project.org] On >>> Behalf Of Dimitri Liakhovitski >>> Sent: Tuesday, March 30, 2010 8:05 AM >>> To: r-help >>> Subject: [R] Code is too slow: mean-centering variables in a >>> data frame >>> bysubgroup >>> >>> Dear R-ers, >>> >>> I have a large data frame (several thousands of rows and about 2.5 >>> thousand columns). One variable ("group") is a grouping variable with >>> over 30 levels. And I have a lot of NAs. >>> For each variable, I need to divide each value by variable mean - by >>> subgroup. I have the code but it's way too slow - takes me about 1.5 >>> hours. >>> Below is a data example and my code that is too slow. Is there a >>> different, faster way of doing the same thing? >>> Thanks a lot for your advice! >>> >>> Dimitri >>> >>> >>> # Building an example frame - with groups and a lot of NAs: >>> set.seed(1234) >>> frame<-data.frame(group=rep(paste("group",1:10),10),a=rnorm(1: >> 100),b=rnorm(1 >>> :100),c=rnorm(1:100),d=rnorm(1:100),e=rnorm(1:100),f=rnorm(1:1 >>> 00),g=rnorm(1: >>> 100)) >>> frame<-frame[order(frame$group),] >>> names.used<-names(frame)[2:length(frame)] >>> set.seed(1234) >>> for(i in names.used){ >>> i.for.NA<-sample(1:100,60) >>> frame[[i]][i.for.NA]<-NA >>> } >>> frame >>> >>> ### Code that does what's needed but is too slow: >>> Start<-Sys.time() >>> frame <- do.call(cbind, lapply(names.used, function(x){ >>> unlist(by(frame, frame$group, function(y) y[,x] / >>> mean(y[,x],na.rm=T))) >>> })) >>> Finish<-Sys.time() >>> print(Finish-Start) # Takes too long >>> >>> -- >>> Dimitri Liakhovitski >>> Ninah.com >>> dimitri.liakhovit...@ninah.com >>> >>> ______________________________________________ >>> R-help@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide >>> http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >>> >>> ______________________________________________ >>> R-help@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide >>> http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >>> >> > > > > -- > Dimitri Liakhovitski > Ninah.com > dimitri.liakhovit...@ninah.com > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.