On Mon, Jul 15, 2013 at 4:24 PM, <[email protected]> wrote: > On Mon, Jul 15, 2013 at 2:55 PM, Nathaniel Smith <[email protected]> wrote: >> On Mon, Jul 15, 2013 at 6:29 PM, Charles R Harris >> <[email protected]> wrote: >>> Let me try to summarize. To begin with, the environment of the nan functions >>> is rather special. >>> >>> 1) if the array is of not of inexact type, they punt to the non-nan >>> versions. >>> 2) if the array is of inexact type, then out and dtype must be inexact if >>> specified >>> >>> The second assumption guarantees that NaN can be used in the return values. >> >> The requirement on the 'out' dtype only exists because currently the >> nan function like to return nan for things like empty arrays, right? >> If not for that, it could be relaxed? (it's a rather weird >> requirement, since the whole point of these functions is that they >> ignore nans, yet they don't always...) >> >>> sum and nansum >>> >>> These should be consistent so that empty sums are 0. This should cover the >>> empty array case, but will change the behaviour of nansum which currently >>> returns NaN if the array isn't empty but the slice is after NaN removal. >> >> I agree that returning 0 is the right behaviour, but we might need a >> FutureWarning period. >> >>> mean and nanmean >>> >>> In the case of empty arrays, an empty slice, this leads to 0/0. For Python >>> this is always a zero division error, for Numpy this raises a warning and >>> and returns NaN for floats, 0 for integers. >>> >>> Currently mean returns NaN and raises a RuntimeWarning when 0/0 occurs. In >>> the special case where dtype=int, the NaN is cast to integer. >>> >>> Option1 >>> 1) mean raise error on 0/0 >>> 2) nanmean no warning, return NaN >>> >>> Option2 >>> 1) mean raise warning, return NaN (current behavior) >>> 2) nanmean no warning, return NaN >>> >>> Option3 >>> 1) mean raise warning, return NaN (current behavior) >>> 2) nanmean raise warning, return NaN >> >> I have mixed feelings about the whole np.seterr apparatus, but since >> it exists, shouldn't we use it for consistency? I.e., just do whatever >> numpy is set up to do with 0/0? (Which I think means, warn and return >> NaN by default, but this can be changed.) >> >>> var, std, nanvar, nanstd >>> >>> 1) if ddof > axis(axes) size, raise error, probably a program bug. >>> 2) If ddof=0, then whatever is the case for mean, nanmean >>> >>> For nanvar, nanstd it is possible that some slice are good, some bad, so >>> >>> option1 >>> 1) if n - ddof <= 0 for a slice, raise warning, return NaN for slice >>> >>> option2 >>> 1) if n - ddof <= 0 for a slice, don't warn, return NaN for slice >> >> I don't really have any intuition for these ddof cases. Just raising >> an error on negative effective dof is pretty defensible and might be >> the safest -- it's a easy to turn an error into something sensible >> later if people come up with use cases... > > related why does reduceat not have empty slices? > >>>> np.add.reduceat(np.arange(8),[0,4, 5, 7,7]) > array([ 6, 4, 11, 7, 7]) > > > I'm in favor of returning nans instead of raising exceptions, except > if the return type is int and we cannot cast nan to int. > > If we get functions into numpy that know how to handle nans, then it > would be useful to get the nans, so we can work with them > > Some cases where this might come in handy are when we iterate over > slices of an array that define groups or category levels with possible > empty groups *) > >>>> idx = np.repeat(np.array([0, 1, 2, 3]), [4, 3, 0, 2]) >>>> x = np.arange(9) >>>> [x[idx==ii].mean() for ii in range(4)] > [1.5, 5.0, nan, 7.5] > > instead of >>>> [x[idx==ii].mean() for ii in range(4) if (idx==ii).sum()>0] > [1.5, 5.0, 7.5] > > same for var, I wouldn't have to check that the size is larger than > the ddof (whatever that is in the specific case) > > *) groups could be empty because they were defined for a larger > dataset or as a union of different datasets
background: I wrote several robust anova versions a few weeks ago, that were essentially list comprehension as above. However, I didn't allow nans and didn't check for minimum size. Allowing for empty groups to return nan would mainly be a convenience, since I need to check the group size only once. ddof: tests for proportions have ddof=0, for regular t-test ddof=1, for tests of correlation ddof=2 IIRC so we would need to check for the corresponding minimum size that n-ddof>0 "negative effective dof" doesn't exist, that's np.maximum(n - ddof, 0) which is always non-negative but might result in a zero-division error. :) I don't think making anything conditional on ddof>0 is useful. Josef > > > PS: I used mean() above and not var() because > >>>> np.__version__ > '1.5.1' >>>> np.mean([]) > nan >>>> np.var([]) > 0.0 > > Josef > >> >> -n >> _______________________________________________ >> NumPy-Discussion mailing list >> [email protected] >> http://mail.scipy.org/mailman/listinfo/numpy-discussion _______________________________________________ NumPy-Discussion mailing list [email protected] http://mail.scipy.org/mailman/listinfo/numpy-discussion
