Re: [R] Code is too slow: mean-centering variables in a data framebysubgroup

Charles C. Berry Thu, 08 Apr 2010 14:36:20 -0700

On Thu, 8 Apr 2010, Dimitri Liakhovitski wrote:

Dear everyone,
I was not sure if I should start a new topic - but the task is the
same, so I am staying within the original one.
Originally, I stated that my data frame has a lot of NAs. Now I am
discovering - it's having a lot of zeros (rather than NAs) - and they
should be ignored when the subgroup means are built.
Therefore, I have to first translate all zeros into NAs, then run my
mean-centering code, then translate all NAs back into zeros. Because I
am dealing with thousands or rows and columns it annihilates all the
speed advantages of the fast code that uses ave().
I was wondering if it's possible to modify the mean-centering code
that works for a frame with NAs to the situation when there are no NAs
but there are a lot of zeros. I am not sure how to make it ignore the
zeros. The codes are below:
Thank you very much!
Dimitri


# Building an example frame - with groups and a lot of zeros (rather than NAs):
set.seed(1234)
frame<-data.frame(group=rep(paste("group",1:10),10),a=rnorm(1:100),b=rnorm(1:100),c=rnorm(1:100),d=rnorm(1:100),e=rnorm(1:100),f=rnorm(1:100),g=rnorm(1:100))
frame<-frame[order(frame$group),]
names.used<-names(frame)[2:length(frame)]
set.seed(1234)
for(i in names.used){
     i.for.zeros<-sample(1:100,60)
     frame[[i]][i.for.zeros]<-0
}
frame

# Mean Centering code - writte for a situation when frame has NAs:
f2 <- function(frame) {
 for(i in 2:ncol(frame)) {
    frame[,i] <- ave(frame[,i], frame[,1], FUN=function(x)x/mean(x,na.rm=TRUE))
 }
 frame
}


Rather

f2 <- function(frame) {
  for(i in 2:ncol(frame)) {
     frame[,i] <- ave(frame[,i], frame[,1],
     FUN=function(x) x/(mean(x)/mean(x!=0)))
  }
  frame
}

HTH,

Chuck

new.frame<-f2(frame)




On Wed, Apr 7, 2010 at 4:54 PM, Tom Short <tshort.rli...@gmail.com> wrote:

Another way that Matthew Dowle showed me for this type of problem is
to reshape frame to a long format. It makes it easier to manipulate
and can be faster.

longdt <- with(frame, data.table(group = unlist(rep(group, each=7)), x = 
c(a,b,c,d,e,f,g)))

system.time(new.frame4 <- longdt[, x/mean(x, na.rm = TRUE), by = "group"])

  user  system elapsed
  0.54    0.04    0.61


# Or, remove the NAs ahead of time for more speed:

longdt2 <- longdt[!is.na(longdt$x),]
system.time(new.frame4 <- longdt2[, x/mean(x), by = "group"])

  user  system elapsed
  0.17    0.00    0.17

- Tom

On Wed, Apr 7, 2010 at 3:46 PM, Tom Short <tshort.rli...@gmail.com> wrote:

Here's how I would have done the data.table method. It's a bit faster
than the ave approach on my machine:

# install.packages("data.table",repos="http://R-Forge.R-project.org";)
library(data.table)

f3 <- function(frame) {

+   frame <- as.data.table(frame)
+   frame[, lapply(.SD[,2:ncol(.SD), with = FALSE],
+                  function(x) x / mean(x, na.rm = TRUE)),
+         by = "group"]
+ }


system.time(new.frame2 <- f2(frame)) # ave

  user  system elapsed
  0.50    0.08    1.24

system.time(new.frame3 <- f3(frame)) # data.table

  user  system elapsed
  0.25    0.01    0.30

- Tom

Tom Short


On Wed, Apr 7, 2010 at 12:46 PM, Dimitri Liakhovitski <ld7...@gmail.com> wrote:

I would like to thank once more everyone who helped me with this question.
I compared the speed for different approaches. Below are the results
of my comparisons - in case anyone is interested:

### Building an EXAMPLE FRAME with N rows - with groups and a lot of NAs:
N<-100000
set.seed(1234)
frame<-data.frame(group=rep(paste("group",1:10),N/10),a=rnorm(1:N),b=rnorm(1:N),c=rnorm(1:N),d=rnorm(1:N),e=rnorm(1:N),f=rnorm(1:N),g=rnorm(1:N))
frame<-frame[order(frame$group),]

## Introducing 60% NAs:
names.used<-names(frame)[2:length(frame)]
set.seed(1234)
for(i in names.used){
     i.for.NA<-sample(1:N,round((N*.6),0))
     frame[[i]][i.for.NA]<-NA
}
lapply(frame[2:8], function(x) length(x[is.na(x)])) # Checking that it worked
ORIGframe<-frame ## placeholder for the unchanged original frame

####### Objective of the code - divide each value by its group mean ####

### METHOD 1 - the FASTEST - using ave():##############################
frame<-ORIGframe
f2 <- function(frame) {
 for(i in 2:ncol(frame)) {
    frame[,i] <- ave(frame[,i], frame[,1], FUN=function(x)x/mean(x,na.rm=TRUE))
 }
 frame
}
system.time({new.frame<-f2(frame)})
# Took me 0.23-0.27 sec
#######################################

### METHOD 2 - fast, just a bit slower - using data.table:
##############################

# If you don't have it - install the package - NOT from CRAN:
install.packages("data.table",repos="http://R-Forge.R-project.org";)
library(data.table)
frame<-ORIGframe
system.time({
table<-data.table(frame)
colMeanFunction<-function(data,key){
 data[[key]]=NULL
 ret=as.matrix(data)/matrix(rep(as.numeric(colMeans(as.data.frame(data),na.rm=T)),nrow(data)),nrow=nrow(data),ncol=ncol(data),byrow=T)
 return(ret)
}
groupedMeans = table[,colMeanFunction(.SD, "group"), by="group"]
names.to.use<-names(groupedMeans)
for(i in 
1:length(groupedMeans)){groupedMeans[[i]]<-as.data.frame(groupedMeans[[i]])}
groupedMeans<-do.call(cbind, groupedMeans)
names(groupedMeans)<-names.to.use
})
# Took me 0.37-.45 sec
#######################################

### METHOD 3 - fast, a tad slower (using model.matrix & matrix
multiplication):##############################
frame<-ORIGframe
system.time({
mat <- as.matrix(frame[,-1])
mm <- model.matrix(~0+group,frame)
col.grp.N <- crossprod( !is.na(mat), mm ) # Use this line if don't
want to use NAs for mean calculations
# col.grp.N <- crossprod( mat != 0 , mm ) # Use this line if don't
want to use zeros for mean calculations
mat[is.na(mat)] <- 0.0
col.grp.sum <- crossprod( mat, mm )
mat <- mat / ( t(col.grp.sum/col.grp.N)[ frame$group,] )
is.na(mat) <- is.na(frame[,-1])
mat<-as.data.frame(mat)
})
# Took me 0.44-0.50 sec
#######################################

### METHOD 5-  much slower - it's the one I started
with:##############################
frame<-ORIGframe
system.time({
frame <- do.call(cbind, lapply(names.used, function(x){
       unlist(by(frame, frame$group, function(y) y[,x] / mean(y[,x],na.rm=T)))
       }))
})
# Took me 1.25-1.32 min
#######################################

### METHOD 6 -  the slowest; using "plyr" and
"ddply":##############################
frame<-ORIGframe
library(plyr)
function3 <- function(x) x / mean(x, na.rm = TRUE)
system.time({
grouping.factor<-"group"
myvariables<-names(frame)[2:8]
frame3<-ddply(frame, grouping.factor, colwise(function3, myvariables))
})
# Took me 1.36-1.47 min
#######################################


Thanks again!
Dimitri


On Wed, Mar 31, 2010 at 8:29 PM, William Dunlap <wdun...@tibco.com> wrote:

Dimitri,

You might try applying ave() to each column.  E.g., use

f2 <- function(frame) {
  for(i in 2:ncol(frame)) {
     frame[,i] <- ave(frame[,i], frame[,1],
FUN=function(x)x/mean(x,na.rm=TRUE))
  }
  frame
}

Note that this returns a data.frame and retains the
grouping column (the first) while your original
code returns a matrix without the grouping column.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com

-----Original Message-----
From: r-help-boun...@r-project.org
[mailto:r-help-boun...@r-project.org] On Behalf Of Bert Gunter
Sent: Tuesday, March 30, 2010 10:52 AM
To: 'Dimitri Liakhovitski'; 'r-help'
Subject: Re: [R] Code is too slow: mean-centering variables
in a data framebysubgroup

?scale

Bert Gunter
Genentech Nonclinical Biostatistics

-----Original Message-----
From: r-help-boun...@r-project.org
[mailto:r-help-boun...@r-project.org] On
Behalf Of Dimitri Liakhovitski
Sent: Tuesday, March 30, 2010 8:05 AM
To: r-help
Subject: [R] Code is too slow: mean-centering variables in a
data frame
bysubgroup

Dear R-ers,

I have  a large data frame (several thousands of rows and about 2.5
thousand columns). One variable ("group") is a grouping variable with
over 30 levels. And I have a lot of NAs.
For each variable, I need to divide each value by variable mean - by
subgroup. I have the code but it's way too slow - takes me about 1.5
hours.
Below is a data example and my code that is too slow. Is there a
different, faster way of doing the same thing?
Thanks a lot for your advice!

Dimitri

# Building an example frame - with groups and a lot of NAs:
set.seed(1234)
frame<-data.frame(group=rep(paste("group",1:10),10),a=rnorm(1:

100),b=rnorm(1

:100),c=rnorm(1:100),d=rnorm(1:100),e=rnorm(1:100),f=rnorm(1:1
00),g=rnorm(1:
100))
frame<-frame[order(frame$group),]
names.used<-names(frame)[2:length(frame)]
set.seed(1234)
for(i in names.used){
       i.for.NA<-sample(1:100,60)
       frame[[i]][i.for.NA]<-NA
}
frame

### Code that does what's needed but is too slow:
Start<-Sys.time()
frame <- do.call(cbind, lapply(names.used, function(x){
  unlist(by(frame, frame$group, function(y) y[,x] /
mean(y[,x],na.rm=T)))
}))
Finish<-Sys.time()
print(Finish-Start) # Takes too long

--
Dimitri Liakhovitski
Ninah.com
dimitri.liakhovit...@ninah.com

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




--
Dimitri Liakhovitski
Ninah.com
dimitri.liakhovit...@ninah.com

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




--
Dimitri Liakhovitski
Ninah.com
dimitri.liakhovit...@ninah.com

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Charles C. Berry                            (858) 534-2098
                                            Dept of Family/Preventive Medicine
E mailto:cbe...@tajo.ucsd.edu               UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Code is too slow: mean-centering variables in a data framebysubgroup

Reply via email to