>>>>> Hervé Pagès <hpa...@fredhutch.org> >>>>> on Tue, 30 Jan 2018 13:30:18 -0800 writes:
> Hi Martin, Henrik, > Thanks for the follow up. > @Martin: I vote for 2) without *any* hesitation :-) > (and uniformity could be restored at some point in the > future by having prod(), rowSums(), colSums(), and others > align with the behavior of length() and sum()) As a matter of fact, I had procrastinated and worked at implementing '2)' already a bit on the weekend and made it work - more or less. It needs a bit more work, and I had also been considering replacing the numbers in the current overflow check if (ii++ > 1000) { \ ii = 0; \ if (s > 9000000000000000L || s < -9000000000000000L) { \ if(!updated) updated = TRUE; \ *value = NA_INTEGER; \ warningcall(call, _("integer overflow - use sum(as.numeric(.))")); \ return updated; \ } \ } \ i.e. think of tweaking the '1000' and '9000000000000000L', but decided to leave these and add comments there about why. For the moment. They may look arbitrary, but are not at all: If you multiply them (which looks correct, if we check the sum 's' only every 1000-th time ...((still not sure they *are* correct))) you get 9*10^18 which is only slightly smaller than 2^63 - 1 which may be the maximal "LONG_INT" integer we have. So, in the end, at least for now, we do not quite go all they way but overflow a bit earlier,... but do potentially gain a bit of speed, notably with the ITERATE_BY_REGION(..) macros (which I did not show above). Will hopefully become available in R-devel real soon now. Martin > Cheers, > H. > On 01/27/2018 03:06 AM, Martin Maechler wrote: >>>>>>> Henrik Bengtsson <henrik.bengts...@gmail.com> >>>>>>> on Thu, 25 Jan 2018 09:30:42 -0800 writes: >> >> > Just following up on this old thread since matrixStats 0.53.0 is now >> > out, which supports this use case: >> >> >> x <- rep(TRUE, times = 2^31) >> >> >> y <- sum(x) >> >> y >> > [1] NA >> > Warning message: >> > In sum(x) : integer overflow - use sum(as.numeric(.)) >> >> >> y <- matrixStats::sum2(x, mode = "double") >> >> y >> > [1] 2147483648 >> >> str(y) >> > num 2.15e+09 >> >> > No coercion is taking place, so the memory overhead is zero: >> >> >> profmem::profmem(y <- matrixStats::sum2(x, mode = "double")) >> > Rprofmem memory profiling of: >> > y <- matrixStats::sum2(x, mode = "double") >> >> > Memory allocations: >> > bytes calls >> > total 0 >> >> > /Henrik >> >> Thank you, Henrik, for the reminder. >> >> Back in June, I had mentioned to Hervé and R-devel that >> 'logical' should remain to be treated as 'integer' as in all >> arithmetic in (S and) R. Hervé did mention the isum() >> function in the C code which is relevant here .. which does have >> a LONG INT counter already -- *but* if we consider that sum() >> has '...' i.e. a conceptually arbitrary number of long vector >> integer arguments that counter won't suffice even there. >> >> Before talking about implementation / patch, I think we should >> consider 2 possible goals of a change --- I agree the status quo >> is not a real option >> >> 1) sum(x) for logical and integer x would return a double >> in any case and overflow should not happen (unless for >> the case where the result would be larger the >> .Machine$double.max which I think will not be possible >> even with "arbitrary" nargs() of sum. >> >> 2) sum(x) for logical and integer x should return an integer in >> all cases there is no overflow, including returning >> NA_integer_ in case of NAs. >> If there would be an overflow it must be detected "in time" >> and the result should be double. >> >> The big advantage of 2) is that it is back compatible in 99.x % >> of use cases, and another advantage that it may be a very small >> bit more efficient. Also, in the case of "counting" (logical), >> it is nice to get an integer instead of double when we can -- >> entirely analogously to the behavior of length() which returns >> integer whenever possible. >> >> The advantage of 1) is uniformity. >> >> We should (at least provisionally) decide between 1) and 2) and then go for that. >> It could be that going for 1) may have bad >> compatibility-consequences in package space, because indeed we >> had documented sum() would be integer for logical and integer arguments. >> >> I currently don't really have time to >> {work on implementing + dealing with the consequences} >> for either .. >> >> Martin >> >> > On Fri, Jun 2, 2017 at 1:58 PM, Henrik Bengtsson >> > <henrik.bengts...@gmail.com> wrote: >> >> I second this feature request (it's understandable that this and >> >> possibly other parts of the code was left behind / forgotten after the >> >> introduction of long vector). >> >> >> >> I think mean() avoids full copies, so in the meanwhile, you can work >> >> around this limitation using: >> >> >> >> countTRUE <- function(x, na.rm = FALSE) { >> >> nx <- length(x) >> >> if (nx < .Machine$integer.max) return(sum(x, na.rm = na.rm)) >> >> nx * mean(x, na.rm = na.rm) >> >> } >> >> >> >> (not sure if one needs to worry about rounding errors, i.e. where n %% 0 != 0) >> >> >> >> x <- rep(TRUE, times = .Machine$integer.max+1) >> >> object.size(x) >> >> ## 8589934632 bytes >> >> >> >> p <- profmem::profmem( n <- countTRUE(x) ) >> >> str(n) >> >> ## num 2.15e+09 >> >> print(n == .Machine$integer.max + 1) >> >> ## [1] TRUE >> >> >> >> print(p) >> >> ## Rprofmem memory profiling of: >> >> ## n <- countTRUE(x) >> >> ## >> >> ## Memory allocations: >> >> ## bytes calls >> >> ## total 0 >> >> >> >> >> >> FYI / related: I've just updated matrixStats::sum2() to support >> >> logicals (develop branch) and I'll also try to update >> >> matrixStats::count() to count beyond .Machine$integer.max. >> >> >> >> /Henrik >> >> >> >> On Fri, Jun 2, 2017 at 4:05 AM, Hervé Pagès <hpa...@fredhutch.org> wrote: >> >>> Hi, >> >>> >> >>> I have a long numeric vector 'xx' and I want to use sum() to count >> >>> the number of elements that satisfy some criteria like non-zero >> >>> values or values lower than a certain threshold etc... >> >>> >> >>> The problem is: sum() returns an NA (with a warning) if the count >> >>> is greater than 2^31. For example: >> >>> >> >>> > xx <- runif(3e9) >> >>> > sum(xx < 0.9) >> >>> [1] NA >> >>> Warning message: >> >>> In sum(xx < 0.9) : integer overflow - use sum(as.numeric(.)) >> >>> >> >>> This already takes a long time and doing sum(as.numeric(.)) would >> >>> take even longer and require allocation of 24Gb of memory just to >> >>> store an intermediate numeric vector made of 0s and 1s. Plus, having >> >>> to do sum(as.numeric(.)) every time I need to count things is not >> >>> convenient and is easy to forget. >> >>> >> >>> It seems that sum() on a logical vector could be modified to return >> >>> the count as a double when it cannot be represented as an integer. >> >>> Note that length() already does this so that wouldn't create a >> >>> precedent. Also and FWIW prod() avoids the problem by always returning >> >>> a double, whatever the type of the input is (except on a complex >> >>> vector). >> >>> >> >>> I can provide a patch if this change sounds reasonable. >> >>> >> >>> Cheers, >> >>> H. >> >>> >> >>> -- >> >>> Hervé Pagès >> >> > -- > Hervé Pagès > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > E-mail: hpa...@fredhutch.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel