[Rd] sum() returns NA on a long *logical* vector when nb of TRUE values exceeds 2^31

2017-06-02 Thread Hervé Pagès

Hi,

I have a long numeric vector 'xx' and I want to use sum() to count
the number of elements that satisfy some criteria like non-zero
values or values lower than a certain threshold etc...

The problem is: sum() returns an NA (with a warning) if the count
is greater than 2^31. For example:

  > xx <- runif(3e9)
  > sum(xx < 0.9)
  [1] NA
  Warning message:
  In sum(xx < 0.9) : integer overflow - use sum(as.numeric(.))

This already takes a long time and doing sum(as.numeric(.)) would
take even longer and require allocation of 24Gb of memory just to
store an intermediate numeric vector made of 0s and 1s. Plus, having
to do sum(as.numeric(.)) every time I need to count things is not
convenient and is easy to forget.

It seems that sum() on a logical vector could be modified to return
the count as a double when it cannot be represented as an integer.
Note that length() already does this so that wouldn't create a
precedent. Also and FWIW prod() avoids the problem by always returning
a double, whatever the type of the input is (except on a complex
vector).

I can provide a patch if this change sounds reasonable.

Cheers,
H.

--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fredhutch.org
Phone:  (206) 667-5791
Fax:(206) 667-1319

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] sum() returns NA on a long *logical* vector when nb of TRUE values exceeds 2^31

2017-06-02 Thread Henrik Bengtsson
I second this feature request (it's understandable that this and
possibly other parts of the code was left behind / forgotten after the
introduction of long vector).

I think mean() avoids full copies, so in the meanwhile, you can work
around this limitation using:

countTRUE <- function(x, na.rm = FALSE) {
  nx <- length(x)
  if (nx < .Machine$integer.max) return(sum(x, na.rm = na.rm))
  nx * mean(x, na.rm = na.rm)
}

(not sure if one needs to worry about rounding errors, i.e. where n %% 0 != 0)

x <- rep(TRUE, times = .Machine$integer.max+1)
object.size(x)
## 8589934632 bytes

p <- profmem::profmem( n <- countTRUE(x) )
str(n)
## num 2.15e+09
print(n == .Machine$integer.max + 1)
## [1] TRUE

print(p)
## Rprofmem memory profiling of:
## n <- countTRUE(x)
##
## Memory allocations:
##  bytes calls
## total 0


FYI / related: I've just updated matrixStats::sum2() to support
logicals (develop branch) and I'll also try to update
matrixStats::count() to count beyond .Machine$integer.max.

/Henrik

On Fri, Jun 2, 2017 at 4:05 AM, Hervé Pagès  wrote:
> Hi,
>
> I have a long numeric vector 'xx' and I want to use sum() to count
> the number of elements that satisfy some criteria like non-zero
> values or values lower than a certain threshold etc...
>
> The problem is: sum() returns an NA (with a warning) if the count
> is greater than 2^31. For example:
>
>   > xx <- runif(3e9)
>   > sum(xx < 0.9)
>   [1] NA
>   Warning message:
>   In sum(xx < 0.9) : integer overflow - use sum(as.numeric(.))
>
> This already takes a long time and doing sum(as.numeric(.)) would
> take even longer and require allocation of 24Gb of memory just to
> store an intermediate numeric vector made of 0s and 1s. Plus, having
> to do sum(as.numeric(.)) every time I need to count things is not
> convenient and is easy to forget.
>
> It seems that sum() on a logical vector could be modified to return
> the count as a double when it cannot be represented as an integer.
> Note that length() already does this so that wouldn't create a
> precedent. Also and FWIW prod() avoids the problem by always returning
> a double, whatever the type of the input is (except on a complex
> vector).
>
> I can provide a patch if this change sounds reasonable.
>
> Cheers,
> H.
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpa...@fredhutch.org
> Phone:  (206) 667-5791
> Fax:(206) 667-1319
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel