So obviously there's a lot of interest in this question, but I'm losing track of all the different issues that've being raised in the 150-post thread of doom. I think I'll find this easier if we start by putting aside the questions about implementation and such and focus for now on the *conceptual model* that we want. Maybe I'm not the only one?
So as far as I can tell, there are three different ways of thinking about masked/missing data that people have been using in the other thread: 1) Missingness is part of the data. Some data is missing, some isn't, this might change through computation on the data (just like some data might change from a 3 to a 6 when we apply some transformation, NA | True could be True, instead of NA), but we can't just "decide" that some data is no longer missing. It makes no sense to ask what value is "really" there underneath the missingness. And It's critical that we keep track of this through all operations, because otherwise we may silently give incorrect answers -- exactly like it's critical that we keep track of the difference between 3 and 6. 2) All the data exists, at least in some sense, but we don't always want to look at all of it. We lay a mask over our data to view and manipulate only parts of it at a time. We might want to use different masks at different times, mutate the mask as we go, etc. The most important thing is to provide convenient ways to do complex manipulations -- preserve masks through indexing operations, overlay the mask from one array on top of another array, etc. When it comes to other sorts of operations then we'd rather just silently skip the masked values -- we know there are values that are masked, that's the whole point, to work with the unmasked subset of the data, so if sum returned NA then that would just be a stupid hassle. 3) The "all things to all people" approach: implement every feature implied by either (1) or (2), and switch back and forth between these conceptual frameworks whenever necessary to make sense of the resulting code. The advantage of deciding up front what our model is is that it makes a lot of other questions easier. E.g., someone asked in the other thread whether, after setting an array element to NA, it would be possible to get back the original value. If we follow (1), the answer is obviously "no", if we follow (2), the answer is obviously "yes", and if we follow (3), the answer is obviously "yes, probably, well, maybe you better check the docs?". My personal opinions on these are: (1): This is a real problem I face, and there isn't any good solution now. Support for this in numpy would be awesome. (2): This feels more like a convenience feature to me; we already have lots of ways to work with subsets of data. I probably wouldn't bother using it, but that's fine -- I don't use np.matrix either, but some people like it. (3): Well, it's a bit of a mess, but I guess it might be better than nothing? But that's just my opinion. I'm wondering if we can get any consensus on which of these we actually *want* (or maybe we want some fourth option!), and *then* we can try to figure out the best way to get there? Pretty much any implementation strategy we've talked about could work for any of these, but hard to decide between them if we don't even know what we're trying to do... -- Nathaniel _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion