Re: [Numpy-discussion] What should be the result in some statistics corner cases?

josef . pktd Mon, 15 Jul 2013 14:58:20 -0700

On Mon, Jul 15, 2013 at 5:34 PM, Charles R Harris
<[email protected]> wrote:
>
>
> On Mon, Jul 15, 2013 at 2:44 PM, <[email protected]> wrote:
>>
>> On Mon, Jul 15, 2013 at 4:24 PM,  <[email protected]> wrote:
>> > On Mon, Jul 15, 2013 at 2:55 PM, Nathaniel Smith <[email protected]> wrote:
>> >> On Mon, Jul 15, 2013 at 6:29 PM, Charles R Harris
>> >> <[email protected]> wrote:
>> >>> Let me try to summarize. To begin with, the environment of the nan
>> >>> functions
>> >>> is rather special.
>> >>>
>> >>> 1) if the array is of not of inexact type, they punt to the non-nan
>> >>> versions.
>> >>> 2) if the array is of inexact type, then out and dtype must be inexact
>> >>> if
>> >>> specified
>> >>>
>> >>> The second assumption guarantees that NaN can be used in the return
>> >>> values.
>> >>
>> >> The requirement on the 'out' dtype only exists because currently the
>> >> nan function like to return nan for things like empty arrays, right?
>> >> If not for that, it could be relaxed? (it's a rather weird
>> >> requirement, since the whole point of these functions is that they
>> >> ignore nans, yet they don't always...)
>> >>
>> >>> sum and nansum
>> >>>
>> >>> These should be consistent so that empty sums are 0. This should cover
>> >>> the
>> >>> empty array case, but will change the behaviour of nansum which
>> >>> currently
>> >>> returns NaN if the array isn't empty but the slice is after NaN
>> >>> removal.
>> >>
>> >> I agree that returning 0 is the right behaviour, but we might need a
>> >> FutureWarning period.
>> >>
>> >>> mean and nanmean
>> >>>
>> >>> In the case of empty arrays, an empty slice, this leads to 0/0. For
>> >>> Python
>> >>> this is always a zero division error, for Numpy this raises a warning
>> >>> and
>> >>> and returns NaN for floats, 0 for integers.
>> >>>
>> >>> Currently mean returns NaN and raises a RuntimeWarning when 0/0
>> >>> occurs. In
>> >>> the special case where dtype=int, the NaN is cast to integer.
>> >>>
>> >>> Option1
>> >>> 1) mean raise error on 0/0
>> >>> 2) nanmean no warning, return NaN
>> >>>
>> >>> Option2
>> >>> 1) mean raise warning, return NaN (current behavior)
>> >>> 2) nanmean no warning, return NaN
>> >>>
>> >>> Option3
>> >>> 1) mean raise warning, return NaN (current behavior)
>> >>> 2) nanmean raise warning, return NaN
>> >>
>> >> I have mixed feelings about the whole np.seterr apparatus, but since
>> >> it exists, shouldn't we use it for consistency? I.e., just do whatever
>> >> numpy is set up to do with 0/0? (Which I think means, warn and return
>> >> NaN by default, but this can be changed.)
>> >>
>> >>> var, std, nanvar, nanstd
>> >>>
>> >>> 1) if ddof > axis(axes) size, raise error, probably a program bug.
>> >>> 2) If ddof=0, then whatever is the case for mean, nanmean
>> >>>
>> >>> For nanvar, nanstd it is possible that some slice are good, some bad,
>> >>> so
>> >>>
>> >>> option1
>> >>> 1) if n - ddof <= 0 for a slice, raise warning, return NaN for slice
>> >>>
>> >>> option2
>> >>> 1) if n - ddof <= 0 for a slice, don't warn, return NaN for slice
>> >>
>> >> I don't really have any intuition for these ddof cases. Just raising
>> >> an error on negative effective dof is pretty defensible and might be
>> >> the safest -- it's a easy to turn an error into something sensible
>> >> later if people come up with use cases...
>> >
>> > related why does reduceat not have empty slices?
>> >
>> >>>> np.add.reduceat(np.arange(8),[0,4, 5, 7,7])
>> > array([ 6,  4, 11,  7,  7])
>> >
>> >
>> > I'm in favor of returning nans instead of raising exceptions, except
>> > if the return type is int and we cannot cast nan to int.
>> >
>> > If we get functions into numpy that know how to handle nans, then it
>> > would be useful to get the nans, so we can work with them
>> >
>> > Some cases where this might come in handy are when we iterate over
>> > slices of an array that define groups or category levels with possible
>> > empty groups *)
>> >
>> >>>> idx = np.repeat(np.array([0, 1, 2, 3]), [4, 3, 0, 2])
>> >>>> x = np.arange(9)
>> >>>> [x[idx==ii].mean() for ii in range(4)]
>> > [1.5, 5.0, nan, 7.5]
>> >
>> > instead of
>> >>>> [x[idx==ii].mean() for ii in range(4) if (idx==ii).sum()>0]
>> > [1.5, 5.0, 7.5]
>> >
>> > same for var, I wouldn't have to check that the size is larger than
>> > the ddof (whatever that is in the specific case)
>> >
>> > *) groups could be empty because they were defined for a larger
>> > dataset or as a union of different datasets
>>
>> background:
>>
>> I wrote several robust anova versions a few weeks ago, that were
>> essentially list comprehension as above. However, I didn't allow nans
>> and didn't check for minimum size.
>> Allowing for empty groups to return nan would mainly be a convenience,
>> since I need to check the group size only once.
>>
>> ddof: tests for proportions have ddof=0, for regular t-test ddof=1,
>> for tests of correlation ddof=2   IIRC
>> so we would need to check for the corresponding minimum size that n-ddof>0
>>
>> "negative effective dof" doesn't exist, that's np.maximum(n - ddof, 0)
>> which is always non-negative but might result in a zero-division
>> error. :)
>>
>> I don't think making anything conditional on ddof>0 is useful.
>>
>
> So how would you want it?
>
> To summarize the problem areas:
>
> 1) What is the sum of an empty slice? NaN or 0?
0 as it is now for sum, (including 0 for nansum with no valid entries).


> 2) What is mean of empy slice? NaN, NaN and warn, or error?
> 3) What if n - ddof < 0 for slice? NaN, NaN and warn, or error?
> 4) What if n - ddof = 0 for slice? NaN, NaN and warn, or error?
>
> I'm tending to NaN and warn for 2 -- 3, because, as Nathaniel notes, the
> warning can be turned into an error by the user. The errstate context
> manager would be good for that.

Yes, That's what I would prefer also, NaN and ZeroDivisionError, for
2-4, including mean, var and std, for both nan and non-nan functions.

with the extra argument that 3) and 4) are the same case   (except in polyfit :)

Josef


>
> Chuck
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> [email protected]
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
_______________________________________________
NumPy-Discussion mailing list
[email protected]
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] What should be the result in some statistics corner cases?

Reply via email to