On Mon, Jul 15, 2013 at 3:57 PM, <[email protected]> wrote: > On Mon, Jul 15, 2013 at 5:34 PM, Charles R Harris > <[email protected]> wrote: > > > > > > On Mon, Jul 15, 2013 at 2:44 PM, <[email protected]> wrote: > >> > >> On Mon, Jul 15, 2013 at 4:24 PM, <[email protected]> wrote: > >> > On Mon, Jul 15, 2013 at 2:55 PM, Nathaniel Smith <[email protected]> > wrote: > >> >> On Mon, Jul 15, 2013 at 6:29 PM, Charles R Harris > >> >> <[email protected]> wrote: > >> >>> Let me try to summarize. To begin with, the environment of the nan > >> >>> functions > >> >>> is rather special. > >> >>> > >> >>> 1) if the array is of not of inexact type, they punt to the non-nan > >> >>> versions. > >> >>> 2) if the array is of inexact type, then out and dtype must be > inexact > >> >>> if > >> >>> specified > >> >>> > >> >>> The second assumption guarantees that NaN can be used in the return > >> >>> values. > >> >> > >> >> The requirement on the 'out' dtype only exists because currently the > >> >> nan function like to return nan for things like empty arrays, right? > >> >> If not for that, it could be relaxed? (it's a rather weird > >> >> requirement, since the whole point of these functions is that they > >> >> ignore nans, yet they don't always...) > >> >> > >> >>> sum and nansum > >> >>> > >> >>> These should be consistent so that empty sums are 0. This should > cover > >> >>> the > >> >>> empty array case, but will change the behaviour of nansum which > >> >>> currently > >> >>> returns NaN if the array isn't empty but the slice is after NaN > >> >>> removal. > >> >> > >> >> I agree that returning 0 is the right behaviour, but we might need a > >> >> FutureWarning period. > >> >> > >> >>> mean and nanmean > >> >>> > >> >>> In the case of empty arrays, an empty slice, this leads to 0/0. For > >> >>> Python > >> >>> this is always a zero division error, for Numpy this raises a > warning > >> >>> and > >> >>> and returns NaN for floats, 0 for integers. > >> >>> > >> >>> Currently mean returns NaN and raises a RuntimeWarning when 0/0 > >> >>> occurs. In > >> >>> the special case where dtype=int, the NaN is cast to integer. > >> >>> > >> >>> Option1 > >> >>> 1) mean raise error on 0/0 > >> >>> 2) nanmean no warning, return NaN > >> >>> > >> >>> Option2 > >> >>> 1) mean raise warning, return NaN (current behavior) > >> >>> 2) nanmean no warning, return NaN > >> >>> > >> >>> Option3 > >> >>> 1) mean raise warning, return NaN (current behavior) > >> >>> 2) nanmean raise warning, return NaN > >> >> > >> >> I have mixed feelings about the whole np.seterr apparatus, but since > >> >> it exists, shouldn't we use it for consistency? I.e., just do > whatever > >> >> numpy is set up to do with 0/0? (Which I think means, warn and return > >> >> NaN by default, but this can be changed.) > >> >> > >> >>> var, std, nanvar, nanstd > >> >>> > >> >>> 1) if ddof > axis(axes) size, raise error, probably a program bug. > >> >>> 2) If ddof=0, then whatever is the case for mean, nanmean > >> >>> > >> >>> For nanvar, nanstd it is possible that some slice are good, some > bad, > >> >>> so > >> >>> > >> >>> option1 > >> >>> 1) if n - ddof <= 0 for a slice, raise warning, return NaN for slice > >> >>> > >> >>> option2 > >> >>> 1) if n - ddof <= 0 for a slice, don't warn, return NaN for slice > >> >> > >> >> I don't really have any intuition for these ddof cases. Just raising > >> >> an error on negative effective dof is pretty defensible and might be > >> >> the safest -- it's a easy to turn an error into something sensible > >> >> later if people come up with use cases... > >> > > >> > related why does reduceat not have empty slices? > >> > > >> >>>> np.add.reduceat(np.arange(8),[0,4, 5, 7,7]) > >> > array([ 6, 4, 11, 7, 7]) > >> > > >> > > >> > I'm in favor of returning nans instead of raising exceptions, except > >> > if the return type is int and we cannot cast nan to int. > >> > > >> > If we get functions into numpy that know how to handle nans, then it > >> > would be useful to get the nans, so we can work with them > >> > > >> > Some cases where this might come in handy are when we iterate over > >> > slices of an array that define groups or category levels with possible > >> > empty groups *) > >> > > >> >>>> idx = np.repeat(np.array([0, 1, 2, 3]), [4, 3, 0, 2]) > >> >>>> x = np.arange(9) > >> >>>> [x[idx==ii].mean() for ii in range(4)] > >> > [1.5, 5.0, nan, 7.5] > >> > > >> > instead of > >> >>>> [x[idx==ii].mean() for ii in range(4) if (idx==ii).sum()>0] > >> > [1.5, 5.0, 7.5] > >> > > >> > same for var, I wouldn't have to check that the size is larger than > >> > the ddof (whatever that is in the specific case) > >> > > >> > *) groups could be empty because they were defined for a larger > >> > dataset or as a union of different datasets > >> > >> background: > >> > >> I wrote several robust anova versions a few weeks ago, that were > >> essentially list comprehension as above. However, I didn't allow nans > >> and didn't check for minimum size. > >> Allowing for empty groups to return nan would mainly be a convenience, > >> since I need to check the group size only once. > >> > >> ddof: tests for proportions have ddof=0, for regular t-test ddof=1, > >> for tests of correlation ddof=2 IIRC > >> so we would need to check for the corresponding minimum size that > n-ddof>0 > >> > >> "negative effective dof" doesn't exist, that's np.maximum(n - ddof, 0) > >> which is always non-negative but might result in a zero-division > >> error. :) > >> > >> I don't think making anything conditional on ddof>0 is useful. > >> > > > > So how would you want it? > > > > To summarize the problem areas: > > > > 1) What is the sum of an empty slice? NaN or 0? > 0 as it is now for sum, (including 0 for nansum with no valid entries). > > > 2) What is mean of empy slice? NaN, NaN and warn, or error? > > 3) What if n - ddof < 0 for slice? NaN, NaN and warn, or error? > > 4) What if n - ddof = 0 for slice? NaN, NaN and warn, or error? > > > > I'm tending to NaN and warn for 2 -- 3, because, as Nathaniel notes, the > > warning can be turned into an error by the user. The errstate context > > manager would be good for that. > > Yes, That's what I would prefer also, NaN and ZeroDivisionError, for > 2-4, including mean, var and std, for both nan and non-nan functions. > > with the extra argument that 3) and 4) are the same case (except in > polyfit :) >
One extra possibility with the nan functions could be a new keyword, error, which would turn warnings into errors. But that might be a bit much. Chuck
_______________________________________________ NumPy-Discussion mailing list [email protected] http://mail.scipy.org/mailman/listinfo/numpy-discussion
