On Tue, Jun 28, 2011 at 9:06 AM, Nathaniel Smith <n...@pobox.com> wrote:
> On Mon, Jun 27, 2011 at 2:03 PM, Mark Wiebe <mwwi...@gmail.com> wrote: > > On Mon, Jun 27, 2011 at 12:18 PM, Matthew Brett <matthew.br...@gmail.com > > > > wrote: > >> You won't get complaints, you'll just lose a group of users, who will, > >> I suspect, stick to NaNs, unsatisfactory as they are. > > > > This blade cuts both ways, we'd lose a group of users if we don't support > > masking semantics, too. > > The problem is, that's inevitable. One might think that trying to find > a compromise solution that picks a few key aspects of each approach > would be a good way to make everyone happy, but in my experience, it > mostly leads to systems that are a muddled mess and that make everyone > unhappy. You're much better off saying screw it, these goals are in > scope and those ones aren't, and we're going to build something > consistent and powerful instead of focusing on how long the feature > list is. That's also the problem with focusing too much on a list of > use cases: you might capture everything on any single list, but there > are actually an infinite variety of use cases that will arise in the > future. If you can generalize beyond the use cases to find some simple > and consistent mental model, and implement that, then that'll work for > all those future use cases too. But sometimes that requires deciding > what *not* to implement. > > Just my opinion, but it's fairly hard won. > > Anyway, it's pretty clear that in this particular case, there are two > distinct features that different people want: the missing data > feature, and the masked array feature. The more I think about it, the > less I see how they can be combined into one dessert topping + floor > wax solution. Here are three particular points where they seem to > contradict each other: > > Missing data: We think memory usage is critical. The ideal solution > has zero overhead. If we can't get that, then at the very least we > want the overhead to be 1 bit/item instead of 1 byte/item. > Masked arrays: We say, it's critical to have good ways to manipulate > the masking array, share it between multiple arrays, and so forth. And > numpy already has great support for all those things! So obviously the > masking array should be exposed as a standard ndarray. > > Missing data: Once you've assigned NA to a value, you should *not* be > able to get at what was stored there before. > Masked arrays: You must be able to unmask a value and recover what was > stored there before. > > (You might think, what difference does it make if you *can* unmask an > item? Us missing data folks could just ignore this feature. But: > whatever we end up implementing is something that I will have to > explain over and over to different people, most of them not > particularly sophisticated programmers. And there's just no sensible > way to explain this idea that if you store some particular value, then > it replaces the old value, but if you store NA, then the old value is > still there. They will get confused, and then store it away as another > example of how computers are arbitrary and confusing and they're just > too dumb to understand them, and I *hate* doing that to people. Plus > the more that happens, the more they end up digging themselves into > some hole by trying things at random, and then I have to dig them out > again. So the point is, we can go either way, but in both ways there > *is* a cost, and we have to decide.) > > Missing data: It's critical that NAs propagate through reduction > operations by default, though there should also be some way to turn > this off. > Masked arrays: Masked values should be silently ignored by reduction > operations, and having to remember to pass a special flag to turn on > this behavior on every single ufunc call would be a huge pain. > > (Masked array advocates: please correct me if I'm misrepresenting you > anywhere above!) > > > That said, Travis favors doing both, so there's a good chance there will > be > > time for it. > > One issue with the current draft is that I don't see any addressing of > how masking-missing and bit-pattern-missing interact: > a = np.zeros(10, dtype="NA[f8]") > a.flags.hasmask = True > a[5] = np.NA # Now what? > > If you're going to implement both things anyway, and you need to > figure out how they interact anyway, then why not split them up into > two totally separate features? > > Here's my proposal: > 1) Add a purely dtype-based support for missing data: > 1.A) Add some flags/metadata to the dtype structure to let it describe > what a missing value looks like for an element of its type. Something > like, an example NA value plus a function that can be called to > identify NAs when they occur in arrays. (Notice that this interface is > general enough to handle both the bit-stealing approach and the > maybe() approach.) > 1.B) Add an np.NA object, and teach the various coercion loops to use > the above fields in the dtype structure to handle it. > 1.C) Teach the various reduction loops that if a particular flag is > set in the dtype, then they also should check for NAs and handle them > appropriately. (If this flag is not set, then it means that this > dtype's ufunc loops are already NA aware and the generic machinery is > not needed unless skipmissing=True is given. This is useful for > user-defined dtypes, and probably also a nice optimization for floats > using NaN.) > 1.D) Finally, as a convenience, add some standard NA-aware dtypes. > Personally, I wouldn't bother with complicated string-based > mini-language described in the current NEP; just define some standard > NA-enabled dtype objects in the numpy namespace or provide a function > that takes a dtype + a NA bit-pattern and spits out an NA-enabled > dtype or whatever. > > 2) Add a better masked array support. > 2.A) Masked arrays are simply arrays with an extra attribute > '.visible', which is an arbitrary numpy array that is broadcastable to > the same shape as the masked array. There's no magic here -- if you > say a.visible = b.visible, then they now share a visibility array, > according to the ordinary rules of Python assignment. (Well, there > needs to be some check for shape compatibility, but that's not much > magic.) > 2.B) To minimize confusion with the missing value support, the way you > mask/unmask items is through expressions like 'a.visible[10] = False'; > there is no magic np.masked object. (There are a few options for what > happens when you try to use scalar indexing explicitly to extract an > invisible value -- you could return the actual value from behind the > mask, or throw an error, or return a scalar masked array whose > .visible attribute was a scalar array containing False. I don't know > what the people who actually use this stuff would prefer :-).) > 2.C) Indexing and shape-changing operations on the masked array are > automatically applied to the .visible array as well. (Attempting to > call .resize() on an array which is being used as the .visible > attribute of some other array is an error.) > 2.D) Ufuncs on masked arrays always ignore invisible items. We can > probably share some code here between the handling of skipmissing=True > for NA-enabled dtypes and invisible items in masked arrays, but that's > purely an implementation detail. > > This approach to masked arrays requires that the ufunc machinery have > some special knowledge of what a masked array is, so masked arrays > would have to become part of the core. I'm not sure whether or not > they should be part of the np.ndarray base class or remain as a > subclass, though. There's an argument that they're more of a > convenience feature like np.matrix, and code which interfaces between > ndarray's and C becomes more complicated if it has to be prepared to > handle visibility. (Note that in contrast, ndarray's can already > contain arbitrary user-defined dtypes, so the missing value support > proposed here doesn't add any new issues to C interfacing.) So maybe > it'd be better to leave it as a core supported subclass? Could go > either way. > > Nathaniel, an implementation using masks will look *exactly* like an implementation using na-dtypes from the user's point of view. Except that taking a masked view of an unmasked array allows ignoring values without destroying or copying the original data. The only downside I can see to an implementation using masks is memory and disk storage, and perhaps memory mapped arrays. And I rather expect the former to solve itself in a few years, eight gigs is becoming a baseline for workstations and in a couple of years I expect that to be up around 16-32, and a few years after that.... In any case we are talking 12% - 25% overhead, and in practice I expect it won't be quite as big a problem as folks project. Chuck
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion