On Sat, Jun 25, 2011 at 10:05 AM, Nathaniel Smith <n...@pobox.com> wrote:
> So obviously there's a lot of interest in this question, but I'm > losing track of all the different issues that've being raised in the > 150-post thread of doom. I think I'll find this easier if we start by > putting aside the questions about implementation and such and focus > for now on the *conceptual model* that we want. Maybe I'm not the only > one? > > So as far as I can tell, there are three different ways of thinking > about masked/missing data that people have been using in the other > thread: > > 1) Missingness is part of the data. Some data is missing, some isn't, > this might change through computation on the data (just like some data > might change from a 3 to a 6 when we apply some transformation, NA | > True could be True, instead of NA), but we can't just "decide" that > some data is no longer missing. It makes no sense to ask what value is > "really" there underneath the missingness. And It's critical that we > keep track of this through all operations, because otherwise we may > silently give incorrect answers -- exactly like it's critical that we > keep track of the difference between 3 and 6. > > 2) All the data exists, at least in some sense, but we don't always > want to look at all of it. We lay a mask over our data to view and > manipulate only parts of it at a time. We might want to use different > masks at different times, mutate the mask as we go, etc. The most > important thing is to provide convenient ways to do complex > manipulations -- preserve masks through indexing operations, overlay > the mask from one array on top of another array, etc. When it comes to > other sorts of operations then we'd rather just silently skip the > masked values -- we know there are values that are masked, that's the > whole point, to work with the unmasked subset of the data, so if sum > returned NA then that would just be a stupid hassle. > > 3) The "all things to all people" approach: implement every feature > implied by either (1) or (2), and switch back and forth between these > conceptual frameworks whenever necessary to make sense of the > resulting code. > > The advantage of deciding up front what our model is is that it makes > a lot of other questions easier. E.g., someone asked in the other > thread whether, after setting an array element to NA, it would be > possible to get back the original value. If we follow (1), the answer > is obviously "no", if we follow (2), the answer is obviously "yes", > and if we follow (3), the answer is obviously "yes, probably, well, > maybe you better check the docs?". > > My personal opinions on these are: > (1): This is a real problem I face, and there isn't any good solution > now. Support for this in numpy would be awesome. > (2): This feels more like a convenience feature to me; we already have > lots of ways to work with subsets of data. I probably wouldn't bother > using it, but that's fine -- I don't use np.matrix either, but some > people like it. > (3): Well, it's a bit of a mess, but I guess it might be better than > nothing? > > But that's just my opinion. I'm wondering if we can get any consensus > on which of these we actually *want* (or maybe we want some fourth > option!), and *then* we can try to figure out the best way to get > there? Pretty much any implementation strategy we've talked about > could work for any of these, but hard to decide between them if we > don't even know what we're trying to do... > > I go for 3 ;) And I think that is where we are heading. By default, masked array operations look like 1), but by taking views one can get 2). I think the crucial aspect here is the use of views, which both saves on storage and fits with the current numpy concept of views. Chuck
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion