Re: [Rd] Fastest non-overlapping binning mean function out there?
Thank you all - I appreciate all your suggestions. My post was mainly to check if there's already an existing and fast implementation out there. I've ended up adopting Martin's first proposal. Something like this: #include #include #include SEXP binMeans(SEXP y, SEXP x, SEXP bx, SEXP retCount) { int ny = Rf_length(y), nx = Rf_length(x), nb = Rf_length(bx)-1; double *yp = REAL(y), *xp = REAL(x), *bxp = REAL(bx); SEXP ans = PROTECT(NEW_NUMERIC(nb)); double *ansp = REAL(ans); int retcount = LOGICAL(retCount)[0]; SEXP count = NULL; int *countp = NULL; int i = 0, j = 0, n = 0, iStart=0; double sum = 0.0; // Assert same lengths of 'x' and 'y' if (ny != nx) { error("Argument 'y' and 'x' are of different lengths: %d != %d", ny, nx); } if (retcount) { count = PROTECT(NEW_INTEGER(nb)); countp = INTEGER(count); } // Skip to the first bin while (xp[iStart] < bxp[0]) { ++iStart; } // For each x... for (i = iStart; i < nx; ++i) { // Skip to a new bin? while (xp[i] >= bxp[j+1]) { // Update statistic of current bin? if (retcount) { countp[j] = n; } ansp[j] = n > 0 ? sum / n : 0; sum = 0.0; n = 0; // ...and move to next ++j; } sum += yp[i]; n += 1; } // Update statistic of last bin ansp[j] = n > 0 ? sum / n : 0; if (retcount) { countp[j] = n; setAttrib(ans, install("count"), count); UNPROTECT(1); // 'count' } UNPROTECT(1); // 'ans' return ans; } // binMeans() BTW, "10,000-millions" was supposed to read as a range (10^4 to 10^6) and not a scalar (10^10), but the misinterpretation got Martin to point out some of the exciting work extending R core to support "very large" integers. And, as Herve pointed out, I forgot to say that the number of bins could be of that range as well, or maybe an order of magnitude less (say ~1-100 data points per bin). /Henrik On Wed, Oct 3, 2012 at 10:50 AM, Martin Morgan wrote: > On 10/3/2012 6:47 AM, Martin Morgan wrote: >> >> On 10/02/2012 05:19 PM, Henrik Bengtsson wrote: >>> >>> Hi, >>> >>> I'm looking for a super-duper fast mean/sum binning implementation >>> available in R, and before implementing z = binnedMeans(x y) in native >>> code myself, does any one know of an existing function/package for >>> this? I'm sure it already exists. So, given data (x,y) and B bins >>> bx[1] < bx[2] < ... < bx[B] < bx[B+1], I'd like to calculate the >>> binned means (or sums) 'z' such that z[1] = mean(x[bx[1] <= x & x < >>> bx[2]]), z[2] = mean(x[bx[2] <= x & x < bx[3]]), z[B]. Let's >>> assume there are no missing values and 'x' and 'bx' is already >>> ordered. The length of 'x' is in the order of 10,000-millions. The >>> number of elements in each bin vary. >> >> >> since x and bx are ordered (sorting x would be expensive), the C code >> seems >> pretty straight-forward and memory-efficient -- create a result vector as >> long >> as bx, then iterate through x accumulating n and it's sum until x[i] >= >> bx[i]. >> (I think R's implementation of mean() actually pays more attention to >> numerical >> error, but with this much data...) >> >> library(inline) >> binmean <- cfunction(signature(x="numeric", bx="numeric"), >> " int nx = Rf_length(x), nb = Rf_length(bx), i, j, n; > > > I'll take my solution back. The problem specification says that x has > 10,000-millions of elements, so we need to use R-devel and > > R_xlen_t nx = Rf_xlength(x), nb = Rf_xlength(bx), i, j, n; > > but there are two further issues. The first is that on my system > > p$ gcc --version > gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3 > > I have __SIZEOF_SIZE_T__ 8 but > > (a) the test in Rinternals.h:52 is of SIZEOF_SIZE_T, which is undefined. I > end up with typedef int R_xlen_t (e.g., after R CMD SHLIB, instead of using > the inline package, to avoid that level of uncertainty) and then > > >> SEXP ans = PROTECT(NEW_NUMERIC(nb)); >> double sum, *xp = REAL(x), *bxp = REAL(bx), *ansp = REAL(ans); >> sum = j = n = 0; >> for (i = 0; i < nx; ++i) { > > > (b) because nx is a signed type, and since nx > .Machine$integer.max is > represented as a negative number, I don't ever iterate this loop. So I'd > have to be more clever if I wanted this to work on all platforms. > > For what it's worth, Herve's solution is also problematic > >> xx = findInterval(bx, x) > Error in findInterval(bx, x) : long vector 'vec' is not supported > > A different strategy for the problem at hand would seem to involve iteration > over sequences of x, collecting sufficient statistics (n, sum) for each > iteration, and calculating the mean at the end of the day. This might also > result in better memory use and allow parallel processing. > > Martin > > >> while (xp[i] >= bxp[j]) { >> ansp[j++] = n > 0 ? sum / n : 0; >> sum = n = 0; >> } >> n += 1; >> sum += xp[i]; >> } >> ansp[j] = n > 0 ? sum / n : 0; >>
[Rd] identical() fails to compare isS4() to TRUE
> setClass("A", "integer") > isS4(new("A")) [1] TRUE > identical(isS4(new("A")), TRUE) [1] FALSE > sessionInfo() R Under development (unstable) (2012-10-04 r60876) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] identical() fails to compare isS4() to TRUE
On 12-10-04 4:57 PM, Martin Morgan wrote: > setClass("A", "integer") > isS4(new("A")) [1] TRUE > identical(isS4(new("A")), TRUE) [1] FALSE > sessionInfo() R Under development (unstable) (2012-10-04 r60876) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base I can confirm this in R-devel, but it doesn't happen in R-patched. In R-devel, isS4 is returning a vector marked as LGLSXP, but containing the value 16, not the usual 1 for TRUE. > .Internal(inspect(isS4(new("A" @4bfbed8 10 LGLSXP g0c1 [] (len=1, tl=0) 16 I'm not going to have time to track this down and fix it. Duncan Murdoch __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] identical() fails to compare isS4() to TRUE
On 10/04/2012 02:23 PM, Duncan Murdoch wrote: On 12-10-04 4:57 PM, Martin Morgan wrote: > setClass("A", "integer") > isS4(new("A")) [1] TRUE > identical(isS4(new("A")), TRUE) [1] FALSE > sessionInfo() R Under development (unstable) (2012-10-04 r60876) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base I can confirm this in R-devel, but it doesn't happen in R-patched. In R-devel, isS4 is returning a vector marked as LGLSXP, but containing the value 16, not the usual 1 for TRUE. > .Internal(inspect(isS4(new("A" @4bfbed8 10 LGLSXP g0c1 [] (len=1, tl=0) 16 I'm not going to have time to track this down and fix it. The bit is set at src/include/Rinternals.h:278 39103jmc #define IS_S4_OBJECT(x) ((x)->sxpinfo.gp & S4_OBJECT_MASK) but it's identical that has changed; it looks as though there are other similar operations in that header and probably elsewhere. Martin Duncan Murdoch -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] How to build a list with missing values? What is missing, anyway?
On Wed, Oct 3, 2012 at 11:21 PM, Josh O'Brien wrote: > >>Say I have argnames <- c("a", "b", "c"). > >From that I want to construct the equivalent of alist(a=, b=, c=). > > Here's a one liner that'll do that for you: > > argnames <- letters[1:3] > setNames(rep(list(bquote()), length(argnames)), argnames) Thanks. Just so I have my mental model correct, I'm gathering that missing/`` is a symbol that the interpreter has a special rule for -- evaluating it raises an error, as opposed to objects that evaluate to themselves or variable names that evaluate to objects. Does the same sort of thing explain the behavior of `...`? When the interpreter comes across `...` in the arguments during evaluation of a call, it trips a special argument-interpolating behavior? Peter __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] identical() fails to compare isS4() to TRUE
On 12-10-04 5:50 PM, Martin Morgan wrote: On 10/04/2012 02:23 PM, Duncan Murdoch wrote: On 12-10-04 4:57 PM, Martin Morgan wrote: > setClass("A", "integer") > isS4(new("A")) [1] TRUE > identical(isS4(new("A")), TRUE) [1] FALSE > sessionInfo() R Under development (unstable) (2012-10-04 r60876) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base I can confirm this in R-devel, but it doesn't happen in R-patched. In R-devel, isS4 is returning a vector marked as LGLSXP, but containing the value 16, not the usual 1 for TRUE. > .Internal(inspect(isS4(new("A" @4bfbed8 10 LGLSXP g0c1 [] (len=1, tl=0) 16 I'm not going to have time to track this down and fix it. The bit is set at src/include/Rinternals.h:278 39103jmc #define IS_S4_OBJECT(x) ((x)->sxpinfo.gp & S4_OBJECT_MASK) but it's identical that has changed; it looks as though there are other similar operations in that header and probably elsewhere. In R-patched, it returned the value 1 for TRUE, so I'm not sure identical() has changed. Duncan Murdoch Martin Duncan Murdoch __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] identical() fails to compare isS4() to TRUE
On 10/04/2012 04:05 PM, Duncan Murdoch wrote: On 12-10-04 5:50 PM, Martin Morgan wrote: On 10/04/2012 02:23 PM, Duncan Murdoch wrote: On 12-10-04 4:57 PM, Martin Morgan wrote: > setClass("A", "integer") > isS4(new("A")) [1] TRUE > identical(isS4(new("A")), TRUE) [1] FALSE > sessionInfo() R Under development (unstable) (2012-10-04 r60876) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base I can confirm this in R-devel, but it doesn't happen in R-patched. In R-devel, isS4 is returning a vector marked as LGLSXP, but containing the value 16, not the usual 1 for TRUE. > .Internal(inspect(isS4(new("A" @4bfbed8 10 LGLSXP g0c1 [] (len=1, tl=0) 16 I'm not going to have time to track this down and fix it. The bit is set at src/include/Rinternals.h:278 39103jmc #define IS_S4_OBJECT(x) ((x)->sxpinfo.gp & S4_OBJECT_MASK) but it's identical that has changed; it looks as though there are other similar operations in that header and probably elsewhere. In R-patched, it returned the value 1 for TRUE, so I'm not sure identical() has changed. R version 2.15.1 Patched (2012-09-24 r60798) has > isS4 function (object) .Call("R_isS4Object", object, PACKAGE = "base") with objects.c:1531 SEXP R_isS4Object(SEXP object) { /* wanted: return isS4(object) ? mkTrue() : mkFalse(); */ return IS_S4_OBJECT(object) ? mkTrue() : mkFalse(); ; } where R-devel has > isS4 function (object) .Primitive("isS4") with coerce.c:1843 case 51:/* isS4 */ LOGICAL(ans)[0] = IS_S4_OBJECT(CAR(args)); break; made in r60395 Martin Duncan Murdoch Martin Duncan Murdoch -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] How to build a list with missing values? What is missing, anyway?
On Thu, Oct 4, 2012 at 6:12 PM, Bert Gunter wrote: > The R Language definition manual explains all of this. Read it. I always reread that before I post to this list. The only relevant mention of "missing" in the R Language Definition that I could find were in section 4.1.2 on page 25, and that section did not answer anything about "missing" as a special symbol (e.g. that my be put in a formal argument list.) There is another mention of `...` in section 4.3.2 and it does not explain some things, such as under what circumstances one would get a "`...` used in incorrect context" error. Peter >> >> Thanks. >> >> Just so I have my mental model correct, I'm gathering that missing/`` >> is a symbol that the interpreter has a special rule for -- evaluating >> it raises an error, as opposed to objects that evaluate to themselves >> or variable names that evaluate to objects. >> >> Does the same sort of thing explain the behavior of `...`? When the >> interpreter comes across `...` in the arguments during evaluation of a >> call, it trips a special argument-interpolating behavior? >> >> Peter >> >> __ >> R-devel@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel > > > > -- > > Bert Gunter > Genentech Nonclinical Biostatistics > > Internal Contact Info: > Phone: 467-7374 > Website: > http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] How to build a list with missing values? What is missing, anyway?
On Oct 5, 2012, at 03:31 , Peter Meilstrup wrote: > On Thu, Oct 4, 2012 at 6:12 PM, Bert Gunter wrote: >> The R Language definition manual explains all of this. Read it. > > I always reread that before I post to this list. > > The only relevant mention of "missing" in the R Language Definition > that I could find were in section 4.1.2 on page 25, and that section > did not answer anything about "missing" as a special symbol (e.g. that > my be put in a formal argument list.) It's not defined as an R object. We did consider doing so briefly, but the semantics would be weird, as you have experienced: An R object that can't be evaluated without triggering an error. It is something that is used by the internals, so that missing values propagate with lazy evaluation semantics. It happens to be implemented as the equivalent of as.name(""), but that's not to be relied upon. It leaks out into user space because arguments can have missing defaults, and there needs to be a way to manipulate argument lists. It also slips out in substitute() with no argument, and I suppose that too is at least semi-intentional. It is not completely obvious that the two are necessarily identical. It's one of those things where you shouldn't rely on behavior outside of well-establihed idioms. If you do, and it breaks, you get to keep both pieces... I suppose we could try documenting it better, but it would be at the risk of authorizing semantics that we might want to change (e.g., it is far from clear that we should be allowing direct assignment of missings to simple symbols). ("..." and friends is a somewhat different kettle of fish which I don't want to go into just now.) -pd > There is another mention of `...` in section 4.3.2 and it does not > explain some things, such as under what circumstances one would get a > "`...` used in incorrect context" error. > > Peter > >>> >>> Thanks. >>> >>> Just so I have my mental model correct, I'm gathering that missing/`` >>> is a symbol that the interpreter has a special rule for -- evaluating >>> it raises an error, as opposed to objects that evaluate to themselves >>> or variable names that evaluate to objects. >>> >>> Does the same sort of thing explain the behavior of `...`? When the >>> interpreter comes across `...` in the arguments during evaluation of a >>> call, it trips a special argument-interpolating behavior? >>> -- Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Email: pd@cbs.dk Priv: pda...@gmail.com __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel