Hi, May be you should try ?data.table().
Please use ?dput(). dat1<- read.table(text=" Area Sex Year y Bob F 2011 1 Bob F 2011 2 Bob F 2012 3 Bob M 2012 3 Bob M 2012 2 Fred F 2011 1 Fred F 2011 1 Fred F 2012 2 Fred M 2012 3 Fred M 2012 1 ",sep="",header=TRUE,stringsAsFactors=FALSE) library(data.table) dt2<-dt1[,sum(y),by=list(Area,Sex,Year)] dt2 # Area Sex Year V1 #1: Bob F 2011 3 #2: Bob F 2012 3 #3: Bob M 2012 5 #4: Fred F 2011 2 #5: Fred F 2012 2 #6: Fred M 2012 4 #Speed set.seed(28) dat2<- data.frame(Area=sample(LETTERS,1e7,replace=TRUE),Sex=sample(c("F","M"),1e7,replace=TRUE),Year=sample(2005:2012,1e7,replace=TRUE),y=sample(1:10,1e7,replace=TRUE)) system.time(datTest<- aggregate(y~.,data=dat2,sum)) # user system elapsed # 18.056 1.336 19.424 datTest2<- datTest[order(datTest$Area,datTest$Sex,datTest$Year),] row.names(datTest2)<- 1:nrow(datTest2) dtTest<- data.table(dat2) system.time({ setkey(dtTest,Area,Sex,Year) dtTest2<- dtTest[,sum(y),by=list(Area,Sex,Year)]}) # user system elapsed # 1.232 0.184 1.418 setnames(dtTest2,"V1","y") identical(datTest2,as.data.frame(dtTest2)) #[1] TRUE A.K. ----- Original Message ----- From: Michael Liaw <michael.l...@hotmail.com> To: r-help@r-project.org Cc: Sent: Saturday, August 3, 2013 8:11 PM Subject: [R] Group by a data frame with multiple columns Hi I'm trying to manipulate a data frame (that has about 10 million rows) rows by "grouping" it with multiple columns. For example, say the data set looks like: Area Sex Year y Bob F 2011 1 Bob F 2011 2 Bob F 2012 3 Bob M 2012 3 Bob M 2012 2 Fred F 2011 1 Fred F 2011 1 Fred F 2012 2 Fred M 2012 3 Fred M 2012 1 And I want it to look like Area Sex Year Sum of y Bob F 2011 3 Bob F 2012 3 Bob M 2012 5 Fred F 2011 2 Fred F 2012 2 Fred M 2012 4 I think I can use something like: tmp <- aggregate (y ~ ., sum) But due to the size it's really taking a strain on the computer (even with 64-bit R on a, yes unfortunately Windows, machine with 16GB RAM :(). The reason for me wanting the data set to get into this form is I want to then apply the population information and get the "rate" on the "sum of y" column then fit a Poisson regression model. I'm wondering (and would appreciate comments) whether there is a more efficient way to the process I described? Cheers Michael [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.