Thanks a lot, Tom. In fact, I am likely to have no NAs at all - only zeros! Dimitri
On Wed, Apr 7, 2010 at 4:54 PM, Tom Short <tshort.rli...@gmail.com> wrote: > Another way that Matthew Dowle showed me for this type of problem is > to reshape frame to a long format. It makes it easier to manipulate > and can be faster. > >> longdt <- with(frame, data.table(group = unlist(rep(group, each=7)), x = >> c(a,b,c,d,e,f,g))) >> >> system.time(new.frame4 <- longdt[, x/mean(x, na.rm = TRUE), by = "group"]) > user system elapsed > 0.54 0.04 0.61 >> >> # Or, remove the NAs ahead of time for more speed: >> >> longdt2 <- longdt[!is.na(longdt$x),] >> system.time(new.frame4 <- longdt2[, x/mean(x), by = "group"]) > user system elapsed > 0.17 0.00 0.17 > > - Tom > > On Wed, Apr 7, 2010 at 3:46 PM, Tom Short <tshort.rli...@gmail.com> wrote: >> Here's how I would have done the data.table method. It's a bit faster >> than the ave approach on my machine: >> >>> # install.packages("data.table",repos="http://R-Forge.R-project.org") >>> library(data.table) >>> >>> f3 <- function(frame) { >> + frame <- as.data.table(frame) >> + frame[, lapply(.SD[,2:ncol(.SD), with = FALSE], >> + function(x) x / mean(x, na.rm = TRUE)), >> + by = "group"] >> + } >>> >>> system.time(new.frame2 <- f2(frame)) # ave >> user system elapsed >> 0.50 0.08 1.24 >>> system.time(new.frame3 <- f3(frame)) # data.table >> user system elapsed >> 0.25 0.01 0.30 >> >> - Tom >> >> Tom Short >> >> >> On Wed, Apr 7, 2010 at 12:46 PM, Dimitri Liakhovitski <ld7...@gmail.com> >> wrote: >>> I would like to thank once more everyone who helped me with this question. >>> I compared the speed for different approaches. Below are the results >>> of my comparisons - in case anyone is interested: >>> >>> ### Building an EXAMPLE FRAME with N rows - with groups and a lot of NAs: >>> N<-100000 >>> set.seed(1234) >>> frame<-data.frame(group=rep(paste("group",1:10),N/10),a=rnorm(1:N),b=rnorm(1:N),c=rnorm(1:N),d=rnorm(1:N),e=rnorm(1:N),f=rnorm(1:N),g=rnorm(1:N)) >>> frame<-frame[order(frame$group),] >>> >>> ## Introducing 60% NAs: >>> names.used<-names(frame)[2:length(frame)] >>> set.seed(1234) >>> for(i in names.used){ >>> i.for.NA<-sample(1:N,round((N*.6),0)) >>> frame[[i]][i.for.NA]<-NA >>> } >>> lapply(frame[2:8], function(x) length(x[is.na(x)])) # Checking that it >>> worked >>> ORIGframe<-frame ## placeholder for the unchanged original frame >>> >>> ####### Objective of the code - divide each value by its group mean #### >>> >>> ### METHOD 1 - the FASTEST - using ave():############################## >>> frame<-ORIGframe >>> f2 <- function(frame) { >>> for(i in 2:ncol(frame)) { >>> frame[,i] <- ave(frame[,i], frame[,1], >>> FUN=function(x)x/mean(x,na.rm=TRUE)) >>> } >>> frame >>> } >>> system.time({new.frame<-f2(frame)}) >>> # Took me 0.23-0.27 sec >>> ####################################### >>> >>> ### METHOD 2 - fast, just a bit slower - using data.table: >>> ############################## >>> >>> # If you don't have it - install the package - NOT from CRAN: >>> install.packages("data.table",repos="http://R-Forge.R-project.org") >>> library(data.table) >>> frame<-ORIGframe >>> system.time({ >>> table<-data.table(frame) >>> colMeanFunction<-function(data,key){ >>> data[[key]]=NULL >>> ret=as.matrix(data)/matrix(rep(as.numeric(colMeans(as.data.frame(data),na.rm=T)),nrow(data)),nrow=nrow(data),ncol=ncol(data),byrow=T) >>> return(ret) >>> } >>> groupedMeans = table[,colMeanFunction(.SD, "group"), by="group"] >>> names.to.use<-names(groupedMeans) >>> for(i in >>> 1:length(groupedMeans)){groupedMeans[[i]]<-as.data.frame(groupedMeans[[i]])} >>> groupedMeans<-do.call(cbind, groupedMeans) >>> names(groupedMeans)<-names.to.use >>> }) >>> # Took me 0.37-.45 sec >>> ####################################### >>> >>> ### METHOD 3 - fast, a tad slower (using model.matrix & matrix >>> multiplication):############################## >>> frame<-ORIGframe >>> system.time({ >>> mat <- as.matrix(frame[,-1]) >>> mm <- model.matrix(~0+group,frame) >>> col.grp.N <- crossprod( !is.na(mat), mm ) # Use this line if don't >>> want to use NAs for mean calculations >>> # col.grp.N <- crossprod( mat != 0 , mm ) # Use this line if don't >>> want to use zeros for mean calculations >>> mat[is.na(mat)] <- 0.0 >>> col.grp.sum <- crossprod( mat, mm ) >>> mat <- mat / ( t(col.grp.sum/col.grp.N)[ frame$group,] ) >>> is.na(mat) <- is.na(frame[,-1]) >>> mat<-as.data.frame(mat) >>> }) >>> # Took me 0.44-0.50 sec >>> ####################################### >>> >>> ### METHOD 5- much slower - it's the one I started >>> with:############################## >>> frame<-ORIGframe >>> system.time({ >>> frame <- do.call(cbind, lapply(names.used, function(x){ >>> unlist(by(frame, frame$group, function(y) y[,x] / >>> mean(y[,x],na.rm=T))) >>> })) >>> }) >>> # Took me 1.25-1.32 min >>> ####################################### >>> >>> ### METHOD 6 - the slowest; using "plyr" and >>> "ddply":############################## >>> frame<-ORIGframe >>> library(plyr) >>> function3 <- function(x) x / mean(x, na.rm = TRUE) >>> system.time({ >>> grouping.factor<-"group" >>> myvariables<-names(frame)[2:8] >>> frame3<-ddply(frame, grouping.factor, colwise(function3, myvariables)) >>> }) >>> # Took me 1.36-1.47 min >>> ####################################### >>> >>> >>> Thanks again! >>> Dimitri >>> >>> >>> On Wed, Mar 31, 2010 at 8:29 PM, William Dunlap <wdun...@tibco.com> wrote: >>>> Dimitri, >>>> >>>> You might try applying ave() to each column. E.g., use >>>> >>>> f2 <- function(frame) { >>>> for(i in 2:ncol(frame)) { >>>> frame[,i] <- ave(frame[,i], frame[,1], >>>> FUN=function(x)x/mean(x,na.rm=TRUE)) >>>> } >>>> frame >>>> } >>>> >>>> Note that this returns a data.frame and retains the >>>> grouping column (the first) while your original >>>> code returns a matrix without the grouping column. >>>> >>>> Bill Dunlap >>>> Spotfire, TIBCO Software >>>> wdunlap tibco.com >>>> >>>>> -----Original Message----- >>>>> From: r-help-boun...@r-project.org >>>>> [mailto:r-help-boun...@r-project.org] On Behalf Of Bert Gunter >>>>> Sent: Tuesday, March 30, 2010 10:52 AM >>>>> To: 'Dimitri Liakhovitski'; 'r-help' >>>>> Subject: Re: [R] Code is too slow: mean-centering variables >>>>> in a data framebysubgroup >>>>> >>>>> ?scale >>>>> >>>>> Bert Gunter >>>>> Genentech Nonclinical Biostatistics >>>>> >>>>> >>>>> >>>>> -----Original Message----- >>>>> From: r-help-boun...@r-project.org >>>>> [mailto:r-help-boun...@r-project.org] On >>>>> Behalf Of Dimitri Liakhovitski >>>>> Sent: Tuesday, March 30, 2010 8:05 AM >>>>> To: r-help >>>>> Subject: [R] Code is too slow: mean-centering variables in a >>>>> data frame >>>>> bysubgroup >>>>> >>>>> Dear R-ers, >>>>> >>>>> I have a large data frame (several thousands of rows and about 2.5 >>>>> thousand columns). One variable ("group") is a grouping variable with >>>>> over 30 levels. And I have a lot of NAs. >>>>> For each variable, I need to divide each value by variable mean - by >>>>> subgroup. I have the code but it's way too slow - takes me about 1.5 >>>>> hours. >>>>> Below is a data example and my code that is too slow. Is there a >>>>> different, faster way of doing the same thing? >>>>> Thanks a lot for your advice! >>>>> >>>>> Dimitri >>>>> >>>>> >>>>> # Building an example frame - with groups and a lot of NAs: >>>>> set.seed(1234) >>>>> frame<-data.frame(group=rep(paste("group",1:10),10),a=rnorm(1: >>>> 100),b=rnorm(1 >>>>> :100),c=rnorm(1:100),d=rnorm(1:100),e=rnorm(1:100),f=rnorm(1:1 >>>>> 00),g=rnorm(1: >>>>> 100)) >>>>> frame<-frame[order(frame$group),] >>>>> names.used<-names(frame)[2:length(frame)] >>>>> set.seed(1234) >>>>> for(i in names.used){ >>>>> i.for.NA<-sample(1:100,60) >>>>> frame[[i]][i.for.NA]<-NA >>>>> } >>>>> frame >>>>> >>>>> ### Code that does what's needed but is too slow: >>>>> Start<-Sys.time() >>>>> frame <- do.call(cbind, lapply(names.used, function(x){ >>>>> unlist(by(frame, frame$group, function(y) y[,x] / >>>>> mean(y[,x],na.rm=T))) >>>>> })) >>>>> Finish<-Sys.time() >>>>> print(Finish-Start) # Takes too long >>>>> >>>>> -- >>>>> Dimitri Liakhovitski >>>>> Ninah.com >>>>> dimitri.liakhovit...@ninah.com >>>>> >>>>> ______________________________________________ >>>>> R-help@r-project.org mailing list >>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>> PLEASE do read the posting guide >>>>> http://www.R-project.org/posting-guide.html >>>>> and provide commented, minimal, self-contained, reproducible code. >>>>> >>>>> ______________________________________________ >>>>> R-help@r-project.org mailing list >>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>> PLEASE do read the posting guide >>>>> http://www.R-project.org/posting-guide.html >>>>> and provide commented, minimal, self-contained, reproducible code. >>>>> >>>> >>> >>> >>> >>> -- >>> Dimitri Liakhovitski >>> Ninah.com >>> dimitri.liakhovit...@ninah.com >>> >>> ______________________________________________ >>> R-help@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >>> >> > -- Dimitri Liakhovitski Ninah.com dimitri.liakhovit...@ninah.com ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.