Re: [Numpy-discussion] numpy.mean still broken for large float32 arrays

Thomas Unterthiner Thu, 24 Jul 2014 03:56:16 -0700

I don't agree. The problem is that I expect `mean` to do somethingreasonable. The documentation mentions that the results can be"inaccurate", which is a huge understatement: the results can be utterlywrong. That is not reasonable. At the very least, a warning should beissued in cases where the dtype might not be appropriate.

One cannot predict what input sizes a program will be run with once it'sin use (especially if it's in use for several years). I'd argue this istrue for pretty much every code except quick one-off scripts. Thus onewould have to use `dtype=np.float64` everywhere. By which point itseems obvious that it should have been the default in the first place.The other alternative would be to extend np.mean with some logic thatinternally figures out the right thing to do (which I don't think is toohard, since ).

Your example with the short axis is something that can be checked for. Iagree that the logic could become a bit hairy, but not too much: If weare going to sum up more than N values (where N could be determined atcompile time, or simply be some constant), we upcast unless the userexplicitly specified a dtype. Of course, this would incur an increase inmemory. However I'd argue that it's not even a large increase: If youcan fit the matrix in memory, then allocating a row/column of float64instead of float32 should be doable, as well. And I'd much rather get anOutOfMemory execption than silently continue my calculations withuseless/wrong results.


Cheers

Thomas



On 2014-07-24 11:59, Eelco Hoogendoorn wrote:

Arguably, this isn't a problem of numpy, but of programmers beingtrained to think of floating point numbers as 'real' numbers, ratherthan just a finite number of states with a funny distribution over thenumber line. np.mean isn't broken; your understanding of floatingpoint number is.
What you appear to wish for is a silent upcasting of the accumulatedresult. This is often performed in reducing operations, but I canimagine it runs into trouble for nd-arrays. After all, if I have ahuge array that I want to reduce over a very short axis, upcastingmight be very undesirable; it wouldn't buy me any extra precision, butit would increase memory use from 'huge' to 'even more huge'.
np.mean has a kwarg that allows you to explicitly choose the dtype ofthe accumulant. X.mean(dtype=np.float64)==1.0. Personally, I have adistaste for implicit behavior, unless the rule is simple and therereally can be no negative downsides; which doesn't apply here I wouldargue. Perhaps when reducing an array completely to a single value,there is no harm in upcasting to the maximum machine precision; butthat becomes a rather complex rule which would work out differentlyfor different machines. Its better to be confronted with thelimitations of floating point numbers earlier, rather than later whenyou want to distribute your work and run into subtle bugs on otherpeoples computers.
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] numpy.mean still broken for large float32 arrays

Reply via email to