Kevin, starting with your idea of sorting first, you can get some speedups just using R. Start by comparing the base case that Bill used:
> x <- runif(2e6) > i <- rep(1:1e6, 2) > unix.time(res0 <- unlist(lapply(split(x,i), sum))) [1] 11.00 0.16 11.28 NA NA Now, try sorting and using a loop: > idx <- order(i) > xs <- x[idx] > is <- i[idx] > res <- array(NA, 1e6) > idx <- which(diff(is) > 0) > startidx <- c(1, idx+1) > endidx <- c(idx, length(xs)) > f1 <- function(x, startidx, endidx, FUN = sum) { + for (j in 1:length(res)) { + res[j] <- FUN(x[startidx[j]:endidx[j]]) + } + res + } > unix.time(res1 <- f1(xs, startidx, endidx)) [1] 6.86 0.00 7.04 NA NA For the case of sum (or averages), you can vectorize this using cumsum as follows. This won't work for median or max. > f2 <- function(x, startidx, endidx) { + cum <- cumsum(x) + res <- cum[endidx] + res[2:length(res)] <- res[2:length(res)] - cum[endidx[1:(length(res) - 1)]] + res + } > unix.time(res2 <- f2(xs, startidx, endidx)) [1] 0.20 0.00 0.21 NA NA You can also use Luke Tierney's byte compiler (http://www.stat.uiowa.edu/~luke/R/compiler/) to speed up the loop for functions where you can't vectorize: > library(compiler) > f3 <- cmpfun(f1) Note: local functions used: FUN > unix.time(res3 <- f3(xs, startidx, endidx)) [1] 3.84 0.00 3.91 NA NA - Tom -- View this message in context: http://www.nabble.com/Any-interest-in-%22merge%22-and-%22by%22-implementations-specifically-for-sorted-data--tf2009595.html#a5562816 Sent from the R devel forum at Nabble.com. ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel