Re: [Rd] Fastest non-overlapping binning mean function out there?

2012-10-04 Thread Henrik Bengtsson
Thank you all - I appreciate all your suggestions.  My post was mainly
to check if there's already an existing and fast implementation out
there.  I've ended up adopting Martin's first proposal.  Something
like this:

#include 
#include 
#include 

SEXP binMeans(SEXP y, SEXP x, SEXP bx, SEXP retCount) {
  int ny = Rf_length(y), nx = Rf_length(x), nb = Rf_length(bx)-1;
  double *yp = REAL(y), *xp = REAL(x), *bxp = REAL(bx);
  SEXP ans = PROTECT(NEW_NUMERIC(nb));
  double *ansp = REAL(ans);
  int retcount = LOGICAL(retCount)[0];
  SEXP count = NULL;
  int *countp = NULL;
  int i = 0, j = 0, n = 0, iStart=0;
  double sum = 0.0;

  // Assert same lengths of 'x' and 'y'
  if (ny != nx) {
error("Argument 'y' and 'x' are of different lengths: %d != %d", ny, nx);
  }

  if (retcount) {
count = PROTECT(NEW_INTEGER(nb));
countp = INTEGER(count);
  }

  // Skip to the first bin
  while (xp[iStart] < bxp[0]) {
++iStart;
  }

  // For each x...
  for (i = iStart; i < nx; ++i) {
// Skip to a new bin?
while (xp[i] >= bxp[j+1]) {
  // Update statistic of current bin?
  if (retcount) { countp[j] = n; }
  ansp[j] = n > 0 ? sum / n : 0;
  sum = 0.0;
  n = 0;
  // ...and move to next
  ++j;
}
sum += yp[i];
n += 1;
  }

  // Update statistic of last bin
  ansp[j] = n > 0 ? sum / n : 0;
  if (retcount) {
countp[j] = n;
setAttrib(ans, install("count"), count);
UNPROTECT(1); // 'count'
  }
  UNPROTECT(1); // 'ans'

  return ans;
} // binMeans()

BTW, "10,000-millions" was supposed to read as a range (10^4 to 10^6)
and not a scalar (10^10), but the misinterpretation got Martin to
point out some of the exciting work extending R core to support "very
large" integers.  And, as Herve pointed out, I forgot to say that the
number of bins could be of that range as well, or maybe an order of
magnitude less (say ~1-100 data points per bin).

/Henrik

On Wed, Oct 3, 2012 at 10:50 AM, Martin Morgan  wrote:
> On 10/3/2012 6:47 AM, Martin Morgan wrote:
>>
>> On 10/02/2012 05:19 PM, Henrik Bengtsson wrote:
>>>
>>> Hi,
>>>
>>> I'm looking for a super-duper fast mean/sum binning implementation
>>> available in R, and before implementing z = binnedMeans(x y) in native
>>> code myself, does any one know of an existing function/package for
>>> this?  I'm sure it already exists.  So, given data (x,y) and B bins
>>> bx[1] < bx[2] < ... < bx[B] < bx[B+1], I'd like to calculate the
>>> binned means (or sums) 'z' such that z[1] = mean(x[bx[1] <= x & x <
>>> bx[2]]), z[2] = mean(x[bx[2] <= x & x < bx[3]]),  z[B].  Let's
>>> assume there are no missing values and 'x' and 'bx' is already
>>> ordered.  The length of 'x' is in the order of 10,000-millions.  The
>>> number of elements in each bin vary.
>>
>>
>> since x and bx are ordered (sorting x would be expensive), the C code
>> seems
>> pretty straight-forward and memory-efficient -- create a result vector as
>> long
>> as bx, then iterate through x accumulating n and it's sum until x[i] >=
>> bx[i].
>> (I think R's implementation of mean() actually pays more attention to
>> numerical
>> error, but with this much data...)
>>
>> library(inline)
>> binmean <- cfunction(signature(x="numeric", bx="numeric"),
>> "   int nx = Rf_length(x), nb = Rf_length(bx), i, j, n;
>
>
> I'll take my solution back. The problem specification says that x has
> 10,000-millions of elements, so we need to use R-devel and
>
> R_xlen_t nx = Rf_xlength(x), nb = Rf_xlength(bx), i, j, n;
>
> but there are two further issues. The first is that on my system
>
> p$ gcc --version
> gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3
>
> I have __SIZEOF_SIZE_T__ 8 but
>
> (a) the test in Rinternals.h:52 is of SIZEOF_SIZE_T, which is undefined. I
> end up with typedef int R_xlen_t (e.g., after R CMD SHLIB, instead of using
> the inline package, to avoid that level of uncertainty) and then
>
>
>>  SEXP ans = PROTECT(NEW_NUMERIC(nb));
>>  double sum, *xp = REAL(x), *bxp = REAL(bx), *ansp = REAL(ans);
>>  sum = j = n = 0;
>>  for (i = 0; i < nx; ++i) {
>
>
> (b) because nx is a signed type, and since nx > .Machine$integer.max is
> represented as a negative number, I don't ever iterate this loop. So I'd
> have to be more clever if I wanted this to work on all platforms.
>
> For what it's worth, Herve's solution is also problematic
>
>> xx = findInterval(bx, x)
> Error in findInterval(bx, x) : long vector 'vec' is not supported
>
> A different strategy for the problem at hand would seem to involve iteration
> over sequences of x, collecting sufficient statistics (n, sum) for each
> iteration, and calculating the mean at the end of the day. This might also
> result in better memory use and allow parallel processing.
>
> Martin
>
>
>>  while (xp[i] >= bxp[j]) {
>>   ansp[j++] = n > 0 ? sum / n : 0;
>>   sum = n = 0;
>>  }
>>  n += 1;
>>  sum += xp[i];
>>  }
>>  ansp[j] = n > 0 ? sum / n : 0;
>>   

[Rd] identical() fails to compare isS4() to TRUE

2012-10-04 Thread Martin Morgan


> setClass("A", "integer")
> isS4(new("A"))
[1] TRUE
> identical(isS4(new("A")), TRUE)
[1] FALSE

> sessionInfo()
R Under development (unstable) (2012-10-04 r60876)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=C LC_NAME=C
 [9] LC_ADDRESS=C   LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] identical() fails to compare isS4() to TRUE

2012-10-04 Thread Duncan Murdoch

On 12-10-04 4:57 PM, Martin Morgan wrote:


  > setClass("A", "integer")
  > isS4(new("A"))
[1] TRUE
  > identical(isS4(new("A")), TRUE)
[1] FALSE

  > sessionInfo()
R Under development (unstable) (2012-10-04 r60876)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
   [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
   [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
   [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8
   [7] LC_PAPER=C LC_NAME=C
   [9] LC_ADDRESS=C   LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base



I can confirm this in R-devel, but it doesn't happen in R-patched.  In 
R-devel, isS4 is returning a vector marked as LGLSXP, but containing the 
value 16, not the usual 1 for TRUE.


> .Internal(inspect(isS4(new("A"
@4bfbed8 10 LGLSXP g0c1 [] (len=1, tl=0) 16

I'm not going to have time to track this down and fix it.

Duncan Murdoch

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] identical() fails to compare isS4() to TRUE

2012-10-04 Thread Martin Morgan

On 10/04/2012 02:23 PM, Duncan Murdoch wrote:

On 12-10-04 4:57 PM, Martin Morgan wrote:


  > setClass("A", "integer")
  > isS4(new("A"))
[1] TRUE
  > identical(isS4(new("A")), TRUE)
[1] FALSE

  > sessionInfo()
R Under development (unstable) (2012-10-04 r60876)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
   [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
   [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
   [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8
   [7] LC_PAPER=C LC_NAME=C
   [9] LC_ADDRESS=C   LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base



I can confirm this in R-devel, but it doesn't happen in R-patched.  In R-devel,
isS4 is returning a vector marked as LGLSXP, but containing the value 16, not
the usual 1 for TRUE.

 > .Internal(inspect(isS4(new("A"
@4bfbed8 10 LGLSXP g0c1 [] (len=1, tl=0) 16

I'm not going to have time to track this down and fix it.


The bit is set at src/include/Rinternals.h:278

 39103jmc #define IS_S4_OBJECT(x) ((x)->sxpinfo.gp & S4_OBJECT_MASK)

but it's identical that has changed; it looks as though there are other similar 
operations in that header and probably elsewhere.


Martin



Duncan Murdoch



--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] How to build a list with missing values? What is missing, anyway?

2012-10-04 Thread Peter Meilstrup
On Wed, Oct 3, 2012 at 11:21 PM, Josh O'Brien  wrote:
>
>>Say I have argnames <- c("a", "b", "c").
> >From that I want to construct the equivalent of alist(a=, b=, c=).
>
> Here's a one liner that'll do that for you:
>
> argnames <- letters[1:3]
> setNames(rep(list(bquote()), length(argnames)), argnames)

Thanks.

Just so I have my mental model correct, I'm gathering that missing/``
is a symbol that the interpreter has a special rule for -- evaluating
it raises an error, as opposed to objects that evaluate to themselves
or variable names that evaluate to objects.

Does the same sort of thing explain the behavior of `...`? When the
interpreter comes across `...` in the arguments during evaluation of a
call, it trips a special argument-interpolating behavior?

Peter

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] identical() fails to compare isS4() to TRUE

2012-10-04 Thread Duncan Murdoch

On 12-10-04 5:50 PM, Martin Morgan wrote:

On 10/04/2012 02:23 PM, Duncan Murdoch wrote:

On 12-10-04 4:57 PM, Martin Morgan wrote:


   > setClass("A", "integer")
   > isS4(new("A"))
[1] TRUE
   > identical(isS4(new("A")), TRUE)
[1] FALSE

   > sessionInfo()
R Under development (unstable) (2012-10-04 r60876)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
[1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C   LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base



I can confirm this in R-devel, but it doesn't happen in R-patched.  In R-devel,
isS4 is returning a vector marked as LGLSXP, but containing the value 16, not
the usual 1 for TRUE.

  > .Internal(inspect(isS4(new("A"
@4bfbed8 10 LGLSXP g0c1 [] (len=1, tl=0) 16

I'm not going to have time to track this down and fix it.


The bit is set at src/include/Rinternals.h:278

   39103jmc #define IS_S4_OBJECT(x) ((x)->sxpinfo.gp & S4_OBJECT_MASK)

but it's identical that has changed; it looks as though there are other similar
operations in that header and probably elsewhere.


In R-patched, it returned the value 1 for TRUE, so I'm not sure 
identical() has changed.


Duncan Murdoch



Martin



Duncan Murdoch





__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] identical() fails to compare isS4() to TRUE

2012-10-04 Thread Martin Morgan

On 10/04/2012 04:05 PM, Duncan Murdoch wrote:

On 12-10-04 5:50 PM, Martin Morgan wrote:

On 10/04/2012 02:23 PM, Duncan Murdoch wrote:

On 12-10-04 4:57 PM, Martin Morgan wrote:


   > setClass("A", "integer")
   > isS4(new("A"))
[1] TRUE
   > identical(isS4(new("A")), TRUE)
[1] FALSE

   > sessionInfo()
R Under development (unstable) (2012-10-04 r60876)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
[1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C   LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base



I can confirm this in R-devel, but it doesn't happen in R-patched.  In R-devel,
isS4 is returning a vector marked as LGLSXP, but containing the value 16, not
the usual 1 for TRUE.

  > .Internal(inspect(isS4(new("A"
@4bfbed8 10 LGLSXP g0c1 [] (len=1, tl=0) 16

I'm not going to have time to track this down and fix it.


The bit is set at src/include/Rinternals.h:278

   39103jmc #define IS_S4_OBJECT(x) ((x)->sxpinfo.gp & S4_OBJECT_MASK)

but it's identical that has changed; it looks as though there are other similar
operations in that header and probably elsewhere.


In R-patched, it returned the value 1 for TRUE, so I'm not sure identical() has
changed.


R version 2.15.1 Patched (2012-09-24 r60798) has

> isS4
function (object)
.Call("R_isS4Object", object, PACKAGE = "base")

with objects.c:1531

SEXP R_isS4Object(SEXP object)
{
/* wanted: return isS4(object) ? mkTrue() : mkFalse(); */
return IS_S4_OBJECT(object) ? mkTrue() : mkFalse(); ;
}

where R-devel has

> isS4
function (object)  .Primitive("isS4")

with coerce.c:1843

case 51:/* isS4 */
LOGICAL(ans)[0] = IS_S4_OBJECT(CAR(args));
break;

made in r60395

Martin



Duncan Murdoch



Martin



Duncan Murdoch








--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] How to build a list with missing values? What is missing, anyway?

2012-10-04 Thread Peter Meilstrup
On Thu, Oct 4, 2012 at 6:12 PM, Bert Gunter  wrote:
> The R Language definition manual explains all of this. Read it.

I always reread that before I post to this list.

The only relevant mention of "missing" in the R Language Definition
that I could find were in section 4.1.2 on page 25, and that section
did not answer anything about "missing" as a special symbol (e.g. that
my be put in a formal argument list.)

There is another mention of `...` in section 4.3.2 and it does not
explain some things, such as under what circumstances one would get a
"`...` used in incorrect context" error.

Peter

>>
>> Thanks.
>>
>> Just so I have my mental model correct, I'm gathering that missing/``
>> is a symbol that the interpreter has a special rule for -- evaluating
>> it raises an error, as opposed to objects that evaluate to themselves
>> or variable names that evaluate to objects.
>>
>> Does the same sort of thing explain the behavior of `...`? When the
>> interpreter comes across `...` in the arguments during evaluation of a
>> call, it trips a special argument-interpolating behavior?
>>
>> Peter
>>
>> __
>> R-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>
>
> --
>
> Bert Gunter
> Genentech Nonclinical Biostatistics
>
> Internal Contact Info:
> Phone: 467-7374
> Website:
> http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] How to build a list with missing values? What is missing, anyway?

2012-10-04 Thread peter dalgaard

On Oct 5, 2012, at 03:31 , Peter Meilstrup wrote:

> On Thu, Oct 4, 2012 at 6:12 PM, Bert Gunter  wrote:
>> The R Language definition manual explains all of this. Read it.
> 
> I always reread that before I post to this list.
> 
> The only relevant mention of "missing" in the R Language Definition
> that I could find were in section 4.1.2 on page 25, and that section
> did not answer anything about "missing" as a special symbol (e.g. that
> my be put in a formal argument list.)

It's not defined as an R object. We did consider doing so briefly, but the 
semantics would be weird, as you have experienced: An R object that can't be 
evaluated without triggering an error. 

It is something that is used by the internals, so that missing values propagate 
with lazy evaluation semantics. It happens to be implemented as the equivalent 
of as.name(""), but that's not to be relied upon. It leaks out into user space 
because arguments can have missing defaults, and there needs to be a way to 
manipulate argument lists. It also slips out in substitute() with no argument, 
and I suppose that too is at least semi-intentional. It is not completely 
obvious that the two are necessarily identical.

It's one of those things where you shouldn't rely on behavior outside of 
well-establihed idioms. If you do, and it breaks, you get to keep both pieces...

I suppose we could try documenting it better, but it would be at the risk of 
authorizing semantics that we might want to change (e.g., it is far from clear 
that we should be allowing direct assignment of missings to simple symbols).

("..." and friends is a somewhat different kettle of fish which I don't want to 
go into just now.)

-pd

> There is another mention of `...` in section 4.3.2 and it does not
> explain some things, such as under what circumstances one would get a
> "`...` used in incorrect context" error.
> 
> Peter
> 
>>> 
>>> Thanks.
>>> 
>>> Just so I have my mental model correct, I'm gathering that missing/``
>>> is a symbol that the interpreter has a special rule for -- evaluating
>>> it raises an error, as opposed to objects that evaluate to themselves
>>> or variable names that evaluate to objects.
>>> 
>>> Does the same sort of thing explain the behavior of `...`? When the
>>> interpreter comes across `...` in the arguments during evaluation of a
>>> call, it trips a special argument-interpolating behavior?
>>> 

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd@cbs.dk  Priv: pda...@gmail.com

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel