Thank you all - I appreciate all your suggestions. My post was mainly to check if there's already an existing and fast implementation out there. I've ended up adopting Martin's first proposal. Something like this:
#include <Rdefines.h> #include <R.h> #include <R_ext/Error.h> SEXP binMeans(SEXP y, SEXP x, SEXP bx, SEXP retCount) { int ny = Rf_length(y), nx = Rf_length(x), nb = Rf_length(bx)-1; double *yp = REAL(y), *xp = REAL(x), *bxp = REAL(bx); SEXP ans = PROTECT(NEW_NUMERIC(nb)); double *ansp = REAL(ans); int retcount = LOGICAL(retCount)[0]; SEXP count = NULL; int *countp = NULL; int i = 0, j = 0, n = 0, iStart=0; double sum = 0.0; // Assert same lengths of 'x' and 'y' if (ny != nx) { error("Argument 'y' and 'x' are of different lengths: %d != %d", ny, nx); } if (retcount) { count = PROTECT(NEW_INTEGER(nb)); countp = INTEGER(count); } // Skip to the first bin while (xp[iStart] < bxp[0]) { ++iStart; } // For each x... for (i = iStart; i < nx; ++i) { // Skip to a new bin? while (xp[i] >= bxp[j+1]) { // Update statistic of current bin? if (retcount) { countp[j] = n; } ansp[j] = n > 0 ? sum / n : 0; sum = 0.0; n = 0; // ...and move to next ++j; } sum += yp[i]; n += 1; } // Update statistic of last bin ansp[j] = n > 0 ? sum / n : 0; if (retcount) { countp[j] = n; setAttrib(ans, install("count"), count); UNPROTECT(1); // 'count' } UNPROTECT(1); // 'ans' return ans; } // binMeans() BTW, "10,000-millions" was supposed to read as a range (10^4 to 10^6) and not a scalar (10^10), but the misinterpretation got Martin to point out some of the exciting work extending R core to support "very large" integers. And, as Herve pointed out, I forgot to say that the number of bins could be of that range as well, or maybe an order of magnitude less (say ~1-100 data points per bin). /Henrik On Wed, Oct 3, 2012 at 10:50 AM, Martin Morgan <mtmor...@fhcrc.org> wrote: > On 10/3/2012 6:47 AM, Martin Morgan wrote: >> >> On 10/02/2012 05:19 PM, Henrik Bengtsson wrote: >>> >>> Hi, >>> >>> I'm looking for a super-duper fast mean/sum binning implementation >>> available in R, and before implementing z = binnedMeans(x y) in native >>> code myself, does any one know of an existing function/package for >>> this? I'm sure it already exists. So, given data (x,y) and B bins >>> bx[1] < bx[2] < ... < bx[B] < bx[B+1], I'd like to calculate the >>> binned means (or sums) 'z' such that z[1] = mean(x[bx[1] <= x & x < >>> bx[2]]), z[2] = mean(x[bx[2] <= x & x < bx[3]]), .... z[B]. Let's >>> assume there are no missing values and 'x' and 'bx' is already >>> ordered. The length of 'x' is in the order of 10,000-millions. The >>> number of elements in each bin vary. >> >> >> since x and bx are ordered (sorting x would be expensive), the C code >> seems >> pretty straight-forward and memory-efficient -- create a result vector as >> long >> as bx, then iterate through x accumulating n and it's sum until x[i] >= >> bx[i]. >> (I think R's implementation of mean() actually pays more attention to >> numerical >> error, but with this much data...) >> >> library(inline) >> binmean <- cfunction(signature(x="numeric", bx="numeric"), >> " int nx = Rf_length(x), nb = Rf_length(bx), i, j, n; > > > I'll take my solution back. The problem specification says that x has > 10,000-millions of elements, so we need to use R-devel and > > R_xlen_t nx = Rf_xlength(x), nb = Rf_xlength(bx), i, j, n; > > but there are two further issues. The first is that on my system > > p$ gcc --version > gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3 > > I have __SIZEOF_SIZE_T__ 8 but > > (a) the test in Rinternals.h:52 is of SIZEOF_SIZE_T, which is undefined. I > end up with typedef int R_xlen_t (e.g., after R CMD SHLIB, instead of using > the inline package, to avoid that level of uncertainty) and then > > >> SEXP ans = PROTECT(NEW_NUMERIC(nb)); >> double sum, *xp = REAL(x), *bxp = REAL(bx), *ansp = REAL(ans); >> sum = j = n = 0; >> for (i = 0; i < nx; ++i) { > > > (b) because nx is a signed type, and since nx > .Machine$integer.max is > represented as a negative number, I don't ever iterate this loop. So I'd > have to be more clever if I wanted this to work on all platforms. > > For what it's worth, Herve's solution is also problematic > >> xx = findInterval(bx, x) > Error in findInterval(bx, x) : long vector 'vec' is not supported > > A different strategy for the problem at hand would seem to involve iteration > over sequences of x, collecting sufficient statistics (n, sum) for each > iteration, and calculating the mean at the end of the day. This might also > result in better memory use and allow parallel processing. > > Martin > > >> while (xp[i] >= bxp[j]) { >> ansp[j++] = n > 0 ? sum / n : 0; >> sum = n = 0; >> } >> n += 1; >> sum += xp[i]; >> } >> ansp[j] = n > 0 ? sum / n : 0; >> UNPROTECT(1); >> return ans; >> ") >> >> with a test case >> >> nx <- 4e7 >> nb <- 1e3 >> x <- sort(runif(nx)) >> bx <- do.call(seq, c(as.list(range(x)), length.out=nb)) >> >> leading to >> >> > bx1 <- c(bx[-1], bx[nb] + 1) >> > system.time(res <- binmean(x, bx1)) >> user system elapsed >> 0.052 0.000 0.050 >> >> Martin >> >>> >>> Thanks, >>> >>> Henrik >>> >>> ______________________________________________ >>> R-devel@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-devel >>> >> >> > > > -- > Dr. Martin Morgan, PhD > > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel