On Sat, Jun 25, 2011 at 11:26 AM, Matthew Brett <matthew.br...@gmail.com>wrote:
> Hi, > > On Sat, Jun 25, 2011 at 5:05 PM, Nathaniel Smith <n...@pobox.com> wrote: > > So obviously there's a lot of interest in this question, but I'm > > losing track of all the different issues that've being raised in the > > 150-post thread of doom. I think I'll find this easier if we start by > > putting aside the questions about implementation and such and focus > > for now on the *conceptual model* that we want. Maybe I'm not the only > > one? > > > > So as far as I can tell, there are three different ways of thinking > > about masked/missing data that people have been using in the other > > thread: > > > > 1) Missingness is part of the data. Some data is missing, some isn't, > > this might change through computation on the data (just like some data > > might change from a 3 to a 6 when we apply some transformation, NA | > > True could be True, instead of NA), but we can't just "decide" that > > some data is no longer missing. It makes no sense to ask what value is > > "really" there underneath the missingness. And It's critical that we > > keep track of this through all operations, because otherwise we may > > silently give incorrect answers -- exactly like it's critical that we > > keep track of the difference between 3 and 6. > > So far I see the difference between 1) and 2) being that you cannot > unmask. So, if you didn't even know you could unmask data, then it > would not matter that 1) was being implemented by masks? > > Yes, bingo, you hit it right on the nose. Essentially, 1) could be considered the "hard mask", while 2) would be the "soft mask". Everything else is implementation details. > > 2) All the data exists, at least in some sense, but we don't always > > want to look at all of it. We lay a mask over our data to view and > > manipulate only parts of it at a time. We might want to use different > > masks at different times, mutate the mask as we go, etc. The most > > important thing is to provide convenient ways to do complex > > manipulations -- preserve masks through indexing operations, overlay > > the mask from one array on top of another array, etc. When it comes to > > other sorts of operations then we'd rather just silently skip the > > masked values -- we know there are values that are masked, that's the > > whole point, to work with the unmasked subset of the data, so if sum > > returned NA then that would just be a stupid hassle. > > To clarify, you're proposing for: > > a = np.sum(np.array([np.NA, np.NA]) > > 1) -> np.NA > 2) -> 0.0 > > ? > Actually, I have always considered this to be a bug. Note that "np.sum([])" also returns 0.0. I think the reason why it has been returning zero instead of NaN was because there wasn't a NaN-equivalent for integers. This is where I think a np.NA could best serve NumPy by providing a dtype-agnostic way to represent missing or invalid data. Ben Root
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion