On Fri, 2005-09-30 at 13:22 -0400, Duncan Murdoch wrote: > I want to calculate a statistic on a number of subgroups of a dataframe, > then put the results into a dataframe. (What SAS PROC MEANS does, I > think, though it's been years since I used it.) > > This is possible using by(), but it seems cumbersome and fragile. Is > there a more straightforward way than this? > > Here's a simple example showing my current strategy: > > > dataset <- data.frame(gp1 = rep(1:2, c(4,4)), gp2 = rep(1:4, > c(2,2,2,2)), value = rnorm(8)) > > dataset > gp1 gp2 value > 1 1 1 0.9493232 > 2 1 1 -0.0474712 > 3 1 2 -0.6808021 > 4 1 2 1.9894999 > 5 2 3 2.0154786 > 6 2 3 0.4333056 > 7 2 4 -0.4746228 > 8 2 4 0.6017522 > > > > handleonegroup <- function(subset) data.frame(gp1 = subset$gp1[1], > + gp2 = subset$gp2[1], statistic = mean(subset$value)) > > > > bylist <- by(dataset, list(dataset$gp1, dataset$gp2), handleonegroup) > > > > result <- do.call('rbind', bylist) > > result > gp1 gp2 statistic > 1 1 1 0.45092598 > 11 1 2 0.65434890 > 12 2 3 1.22439210 > 13 2 4 0.06356469 > > tapply() is inappropriate because I don't have all possible combinations > of gp1 and gp2 values, only some of them: > > > tapply(dataset$value, list(dataset$gp1, dataset$gp2), mean) > 1 2 3 4 > 1 0.450926 0.6543489 NA NA > 2 NA NA 1.224392 0.06356469 > > > > In the real case, I only have a very sparse subset of all the > combinations, and tapply() and by() both die for lack of memory. > > Any suggestions on how to do what I want, without using SAS? > > Duncan Murdoch
Duncan, Does this do what you want? > set.seed(1) > df <- data.frame(gp1 = rep(1:2, c(4,4)), gp2 = rep(1:4, c(2,2,2,2)), value = rnorm(8)) > df gp1 gp2 value 1 1 1 -0.6264538 2 1 1 0.1836433 3 1 2 -0.8356286 4 1 2 1.5952808 5 2 3 0.3295078 6 2 3 -0.8204684 7 2 4 0.4874291 8 2 4 0.7383247 > means <- aggregate(df$value, list(gp1 = df$gp1, gp2 = df$gp2), mean) > means gp1 gp2 x 1 1 1 -0.2214052 2 1 2 0.3798261 3 2 3 -0.2454803 4 2 4 0.6128769 > merge(df, means, by = c("gp1", "gp2")) gp1 gp2 value x 1 1 1 -0.6264538 -0.2214052 2 1 1 0.1836433 -0.2214052 3 1 2 -0.8356286 0.3798261 4 1 2 1.5952808 0.3798261 5 2 3 0.3295078 -0.2454803 6 2 3 -0.8204684 -0.2454803 7 2 4 0.4874291 0.6128769 8 2 4 0.7383247 0.6128769 HTH, Marc Schwartz ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel