[Rd] What algorithm is R using to calculate mean?

2013-07-26 Thread Zach Harrington
I am curious to know what algorithm R's mean function uses. Is there 
some reference to the numerical properties of this algorithm?


I found the following C code in summary.c:do_summary():
case REALSXP:
PROTECT(ans = allocVector(REALSXP, 1));
for (i = 0; i < n; i++) s += REAL(x)[i];
s /= n;
if(R_FINITE((double)s)) {
for (i = 0; i < n; i++) t += (REAL(x)[i] - s);
s += t/n;
}
REAL(ans)[0] = s;
break;

It seems to do a straight up mean:
for (i = 0; i < n; i++) s += REAL(x)[i];
s /= n;

Then it adds what i assume is a numerical correction which seems to be 
the mean difference from the mean of the data:

for (i = 0; i < n; i++) t += (REAL(x)[i] - s);
s += t/n;

I haven't been able to track this algorithm down anywhere (mean is not a 
great search term).


Any help would be much appreciated,

Zach Harrington

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] What algorithm is R using to calculate mean?

2013-07-26 Thread Zach Harrington
I don't believe this is Kahan summation (which I looked at while trying 
to track this down). Kahan summation seems to work by attempting to 
capture the error in each accumulation to the sum, then adding it back 
to the next element before it is accumulated to the sum.

I believe the R algorithm works as follows.

The first standard calculation of the mean is effectively an estimate of 
the algebraic mean, due to floating point errors (which gets worse the 
further the sum gets away from the elements being accumulated).

The second pass sums the differences of the elements from the estimated 
mean. There should be no net difference as the values on either side of 
the mean should balance out, but we have floating point error. The 
differences from the mean still have the potential for error, but these 
should be smaller than the worst potential difference between a element 
and the accumulating sum (at least the estimated mean lives somewhere 
within the range of values, while the summation may escape it).  
Dividing by N gives you the average difference from the mean, which you 
then use to nudge your initial estimate closer to the true mean. You 
could repeat this to get closer and closer, but at some point the 
floating point error in calculating the average difference from the mean 
will defeat you. I guess one pass is close enough.

All this was explained to me by my wife. Why didn't I think to ask her 
sooner? Choose wisely.





On 2013-07-26 8:58 AM, Ravi Varadhan wrote:
> This uses the idea of Kahan's summation, if I am not mistaken.
>
> http://en.wikipedia.org/wiki/Kahan_summation_algorithm
>
> Ravi
> 
> From: r-devel-boun...@r-project.org [r-devel-boun...@r-project.org] on behalf 
> of Joshua Ulrich [josh.m.ulr...@gmail.com]
> Sent: Friday, July 26, 2013 7:12 AM
> To: Zach Harrington
> Cc: r-devel@r-project.org List
> Subject: Re: [Rd] What algorithm is R using to calculate mean?
>
> This was also asked on StackOverflow:
> http://stackoverflow.com/q/17866149/271616.  Here is the answer I
> posted:
>
> This appears to be the updating method of West, 1979 [1] and it was
> implemented in R-2.3.0 in response to PR#1228 [2].
>
> I'm not positive this is the correct algorithm, since it was suggested
> by Martin Maechler but implemented by Brian Ripley. I couldn't find a
> reference in the source code or version control logs that listed the
> actual algorithm used. It was implemented in cov.c in revision 37389
> and in summary.c in revision 37393.
>
> [1] http://dl.acm.org/citation.cfm?doid=359146.359153
> [2] https://bugs.r-project.org/bugzilla3/show_bug.cgi?id=1228
>
> Best,
> --
> Joshua Ulrich  |  about.me/joshuaulrich
> FOSS Trading  |  www.fosstrading.com
>
>
> On Thu, Jul 25, 2013 at 2:44 PM, Zach Harrington
>  wrote:
>> I am curious to know what algorithm R's mean function uses. Is there some
>> reference to the numerical properties of this algorithm?
>>
>> I found the following C code in summary.c:do_summary():
>> case REALSXP:
>>  PROTECT(ans = allocVector(REALSXP, 1));
>>  for (i = 0; i < n; i++) s += REAL(x)[i];
>>  s /= n;
>>  if(R_FINITE((double)s)) {
>>  for (i = 0; i < n; i++) t += (REAL(x)[i] - s);
>>  s += t/n;
>>  }
>>  REAL(ans)[0] = s;
>>  break;
>>
>> It seems to do a straight up mean:
>> for (i = 0; i < n; i++) s += REAL(x)[i];
>> s /= n;
>>
>> Then it adds what i assume is a numerical correction which seems to be the
>> mean difference from the mean of the data:
>> for (i = 0; i < n; i++) t += (REAL(x)[i] - s);
>> s += t/n;
>>
>> I haven't been able to track this algorithm down anywhere (mean is not a
>> great search term).
>>
>> Any help would be much appreciated,
>>
>> Zach Harrington
>>
>> __
>> R-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel


[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel