Just following up on this old thread since matrixStats 0.53.0 is now out, which supports this use case:
> x <- rep(TRUE, times = 2^31) > y <- sum(x) > y [1] NA Warning message: In sum(x) : integer overflow - use sum(as.numeric(.)) > y <- matrixStats::sum2(x, mode = "double") > y [1] 2147483648 > str(y) num 2.15e+09 No coercion is taking place, so the memory overhead is zero: > profmem::profmem(y <- matrixStats::sum2(x, mode = "double")) Rprofmem memory profiling of: y <- matrixStats::sum2(x, mode = "double") Memory allocations: bytes calls total 0 /Henrik On Fri, Jun 2, 2017 at 1:58 PM, Henrik Bengtsson <henrik.bengts...@gmail.com> wrote: > I second this feature request (it's understandable that this and > possibly other parts of the code was left behind / forgotten after the > introduction of long vector). > > I think mean() avoids full copies, so in the meanwhile, you can work > around this limitation using: > > countTRUE <- function(x, na.rm = FALSE) { > nx <- length(x) > if (nx < .Machine$integer.max) return(sum(x, na.rm = na.rm)) > nx * mean(x, na.rm = na.rm) > } > > (not sure if one needs to worry about rounding errors, i.e. where n %% 0 != 0) > > x <- rep(TRUE, times = .Machine$integer.max+1) > object.size(x) > ## 8589934632 bytes > > p <- profmem::profmem( n <- countTRUE(x) ) > str(n) > ## num 2.15e+09 > print(n == .Machine$integer.max + 1) > ## [1] TRUE > > print(p) > ## Rprofmem memory profiling of: > ## n <- countTRUE(x) > ## > ## Memory allocations: > ## bytes calls > ## total 0 > > > FYI / related: I've just updated matrixStats::sum2() to support > logicals (develop branch) and I'll also try to update > matrixStats::count() to count beyond .Machine$integer.max. > > /Henrik > > On Fri, Jun 2, 2017 at 4:05 AM, Hervé Pagès <hpa...@fredhutch.org> wrote: >> Hi, >> >> I have a long numeric vector 'xx' and I want to use sum() to count >> the number of elements that satisfy some criteria like non-zero >> values or values lower than a certain threshold etc... >> >> The problem is: sum() returns an NA (with a warning) if the count >> is greater than 2^31. For example: >> >> > xx <- runif(3e9) >> > sum(xx < 0.9) >> [1] NA >> Warning message: >> In sum(xx < 0.9) : integer overflow - use sum(as.numeric(.)) >> >> This already takes a long time and doing sum(as.numeric(.)) would >> take even longer and require allocation of 24Gb of memory just to >> store an intermediate numeric vector made of 0s and 1s. Plus, having >> to do sum(as.numeric(.)) every time I need to count things is not >> convenient and is easy to forget. >> >> It seems that sum() on a logical vector could be modified to return >> the count as a double when it cannot be represented as an integer. >> Note that length() already does this so that wouldn't create a >> precedent. Also and FWIW prod() avoids the problem by always returning >> a double, whatever the type of the input is (except on a complex >> vector). >> >> I can provide a patch if this change sounds reasonable. >> >> Cheers, >> H. >> >> -- >> Hervé Pagès >> >> Program in Computational Biology >> Division of Public Health Sciences >> Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N, M1-B514 >> P.O. Box 19024 >> Seattle, WA 98109-1024 >> >> E-mail: hpa...@fredhutch.org >> Phone: (206) 667-5791 >> Fax: (206) 667-1319 >> >> ______________________________________________ >> R-devel@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel