Wow, thanks for this extensive reply, Andre. As Duncan mentioned, it's probably not worth the bother for such a rare case. And my curiosity as to why the timing difference was happening has been more than satisfied.
Best, Wolfgang >-----Original Message----- >From: GILLIBERT, Andre [mailto:andre.gillib...@chu-rouen.fr] >Sent: Thursday, 02 September, 2021 21:54 >To: Viechtbauer, Wolfgang (SP); r-devel@r-project.org >Subject: RE: sum() and mean() for (ALTREP) integer sequences > >Hello, > >Please, find a long response below. > >== difference between mean(x) and sum(x)/length(x) == > >At the core, mean(x) and sum(x)/length(x) works very differently for real >numbers. >Mean is more accurate when applied to a vector with a small variance but a very >high mean, especially on platforms without extended precision numbers (i.e. >long >double is 64 bits rather than 80 bits). > >On a Windows 10 computer (with 80 bits long double): >k=1e6 >base=2^51 >realmean = mean(1:k) >sqa=(base+1):(base+k) # with ALTREP >sq=(base+1):(base+k)+0.0 # without ALTREP >mean(sq) - base - realmean # correctly returns zero >sum(sq)/k - base - realmean # incorrectly returns -0.5 >sum(sqa)/k - base - realmean # correctly returns zero > >On a GNU/Linux x86_64 computer with extended precision floating point disabled, >the difference is worse: >mean(sq) - base - realmean # correctly returns zero >sum(sq)/k - base - realmean # incorrectly returns -1180 >sum(sqa)/k - base - realmean # correctly returns zero > >Therefore (without ALTREP) sum can be inaccurate, due to floating point >rounding >errors. > >The good accuracy of mean() is due to a two-passes algorithm of the real_mean() >function in src/main/summary.c. > >The algorithm can be summarized by the following equivalent R code: >badmean=function(v) {sum(v)/length(v)} >goodmean=function(v) {center = badmean(v); center + badmean(v - center)} >goodmean(sq) - base - realmean # correctly returns zero > >== should mean() call ALTINTEGER_SUM ? == >As you noticed in the examples above, sum() is much more accurate with ALTREPs >than with the naive algorithm, because there are no cumulative rounding errors. > >Moreover, if we focus on INTSXP, the maximum possible value is lower : INT_MAX >= >2^31-1 >The sum of a large sequence (e.g. 1:(2^31-1)) can still overflow the exact >integer >range (0 to 2^53) of an FP64, and the division does not always round back to >the >correct value. > >bad = 1:189812531L >mean(bad) - sum(bad)/length(bad) # returns -1.5e-08 on a platform with FP80 >mean(bad) == 94906266 # correct value (the actual result is an integer) > >The implementation of mean() on INTSXP do not use the two-passes trick, but >relies >on a LDOUBLE (typically FP80 on the x86 platform) that is large enough to avoid >rounding bugs even with huge integer sequences. > >Unfortunately the ALTINTEGER_SUM interface returns at best a FP64, and so, >would >not return the FP64 closest to the actual mean for some sequences. > >Adding a MEAN function to the ALTREP interface could solve the problem. > >== can mean performance be improved easily ? == > >The mean() implementation for integers, supports ALTREPs with poor iteration >performances, using the slow INTEGER_ELT() macro. >Moreover, it converts each integer to LDOUBLE, which is slow. > >It can be improved using ITERATE_BY_REGION0 and using a 64 bits integer (if >available) that cannot overflow on small batches (size = 512). > ># before patching (on Ubuntu x86_64 Silvermont Celeron J1900) >x = 1:1e8 >y = 1:1e8+0L >system.time(mean(x)) # user 1.33 second >system.time(mean(y)) # user 0.32 second > ># after patching (on Ubuntu x86_64 Silvermont Celeron J1900) ># after patching (on Ubuntu x86_64 Silvermont Celeron J1900) >x = 1:1e8 >y = 1:1e8+0L >system.time(mean(x)) # user 0.29 second # more than x4 faster >system.time(mean(y)) # user 0.18 second # x 1.7 faster ! > >(patch attached to this message) > >The patch is not optimal. It should ideally use isum(), and risum() but these >functions are a mess, needing refactoring. For instance, they take a 'call' >argument and may display a warning message related to the sum() function. > >-- >Sincerely >Andre GILLIBERT >________________________________________ >De : R-devel <r-devel-boun...@r-project.org> de la part de Viechtbauer, >Wolfgang >(SP) <wolfgang.viechtba...@maastrichtuniversity.nl> >Envoyé : jeudi 2 septembre 2021 12:55:03 >À : r-devel@r-project.org >Objet : [Rd] sum() and mean() for (ALTREP) integer sequences > >Hi all, > >I am trying to understand the performance of functions applied to integer >sequences. Consider the following: > >### begin example ### > >library(lobstr) >library(microbenchmark) > >x <- sample(1e6) >obj_size(x) ># 4,000,048 B > >y <- 1:1e6 >obj_size(y) ># 680 B > ># So we can see that 'y' uses ALTREP. These are, as expected, the same: > >sum(x) ># [1] 500000500000 >sum(y) ># [1] 500000500000 > ># For 'x', we have to go through the trouble of actually summing up 1e6 >integers. ># For 'y', knowing its form, we really just need to do: > >1e6*(1e6+1)/2 ># [1] 500000500000 > ># which should be a whole lot faster. And indeed, it is: > >microbenchmark(sum(x),sum(y)) > ># Unit: nanoseconds ># expr min lq mean median uq max neval cld ># sum(x) 533452 595204.5 634266.90 613102.5 638271.5 978519 100 b ># sum(y) 183 245.5 446.09 338.5 447.0 3233 100 a > ># Now what about mean()? > >mean(x) ># [1] 500000.5 >mean(y) ># [1] 500000.5 > ># which is the same as > >(1e6+1)/2 ># [1] 500000.5 > ># But this surprised me: > >microbenchmark(mean(x),mean(y)) > ># Unit: microseconds ># expr min lq mean median uq max neval cld ># mean(x) 935.389 943.4795 1021.423 954.689 985.122 2065.974 100 a ># mean(y) 3500.262 3581.9530 3814.664 3637.984 3734.598 5866.768 100 b > >### end example ### > >So why is mean() on an ALTREP sequence slower when sum() is faster? > >And more generally, when using sum() on an ALTREP integer sequence, does R >actually use something like n*(n+1)/2 (or generalized to sequences a:b -- >(a+b)*(b-a+1)/2) for computing the sum? If so, why not (it seems) for mean()? > >Best, >Wolfgang ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel