On Sat, Jun 25, 2011 at 9:26 AM, Matthew Brett <matthew.br...@gmail.com> wrote: > So far I see the difference between 1) and 2) being that you cannot > unmask. So, if you didn't even know you could unmask data, then it > would not matter that 1) was being implemented by masks?
I guess that is a difference, but I'm trying to get at something more fundamental -- not just what operations are allowed, but what operations people *expect* to be allowed. It seems like some of us have been talking past each other a lot, where someone says "but changing masks is the single most important feature!" and then someone else says "what are you talking about that doesn't even make sense". > To clarify, you're proposing for: > > a = np.sum(np.array([np.NA, np.NA]) > > 1) -> np.NA > 2) -> 0.0 Yes -- and in R you get actually do get NA, while in numpy.ma you actually do get 0. I don't think this is a coincidence; I think it's because they're designed as coherent systems that are trying to solve different problems. (Well, numpy.ma's "hardmask" idea seems inspired by the missing-data concept rather than the temporary-mask concept, but aside from that it seems pretty consistent in implementing option 2.) Here's another possible difference -- in (1), intuitively, missingness is a property of the data, so the logical place to put information about whether you can expect missing values is in the dtype, and to enable missing values you need to make a new array with a new dtype. (If we use a mask-based implementation, then np.asarray(nomissing_array, dtype=yesmissing_type) would still be able to skip making a copy of the data -- I'm talking ONLY about the interface here, not whether missing data has a different storage format from non-missing data.) In (2), the whole point is to use different masks with the same data, so I'd argue masking should be a property of the array object rather than the dtype, and the interface should logically allow masks to be created, modified, and destroyed in place. They're both internally consistent, but I think we might have to make a decision and stick to it. > I agree it's good to separate the API from the implementation. I > think the implementation is also important because I care about memory > and possibly speed. But, that is a separate problem from the API... Yes, absolutely memory and speed are important. But a really fast solution to the wrong problem isn't so useful either :-). -- Nathaniel _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion