On Fri, Jun 24, 2011 at 6:59 AM, Matthew Brett <[email protected]>wrote:
> Hi, > > On Fri, Jun 24, 2011 at 2:32 AM, Nathaniel Smith <[email protected]> wrote: > ... > > If we think that the memory overhead for floating point types is too > > high, it would be easy to add a special case where maybe(float) used a > > distinguished NaN instead of a separate boolean. The extra complexity > > would be isolated to the 'maybe' dtype's inner loop functions, and > > transparent to the Python level. (Implementing a similar optimization > > for the masking approach would be really nasty.) This would change the > > overhead comparison to 0% versus 12.5% in favor of the dtype approach. > > Can I take this chance to ask Mark a bit more about the problems he > sees for the dtypes with missing values? That is have a > > np.float64_with_missing > np.int32_with_missing > > type dtypes. I see in your NEP you say 'The trouble with this > approach is that it requires a large amount of special case code in > each data type, and writing a new data type supporting missing data > requires defining a mechanism for a special signal value which may not > be possible in general.' > > Just to be clear, you are saying that that, for each dtype, there > needs to be some code doing: > > missing_value = dtype.missing_value > > then, in loops: > > if val[here] == missing_value: > do_something() > > and the fact that 'missing_value' could be any type would make the > code more complicated than the current case where the mask is always > bools or something? > I'm referring to the underlying C implementations of the dtypes and any additional custom dtypes that people create. With the masked approach, you implement a new custom data type in C, and it automatically works with missing data. With the custom dtype approach, you have to do a lot more error-prone work to handle the special values in all the ufuncs. > > Nathaniel's point about reduction in storage needed for the mask to 0 > is surely significant if we want numpy to be the best choice for big > data. > The mask will only be there if it's explicitly requested, so it's not taking away from NumPy in any way. If someone is dealing with data that large, I likely wouldn't always be with the particular NA conventions NumPy chooses for the various primitive data types, so that approach isn't a clear win either. You mention that it would be good to allow masking for any new dtype - > is that a practical problem? I mean, how many people will in fact > have the combination of a) need of masking b) need of custom dtype, > and c) lack of time or expertise to implement masking for that type? > Well, the people who need that right now will probably look at the NumPy C source code and give up immediately. I'd rather push the system in a direction of it being easier for those people than harder. It should be possible to define a C++ data type class with overloaded operators, then say NPY_EXPOSE_DTYPE(MyCustomClass), which would wrap those overloaded operators with NumPy conventions. If this were done, I suspect many people would create custom data types. -Mark > > Thanks a lot for the proposal and the discussion, > > Matthew > _______________________________________________ > NumPy-Discussion mailing list > [email protected] > http://mail.scipy.org/mailman/listinfo/numpy-discussion >
_______________________________________________ NumPy-Discussion mailing list [email protected] http://mail.scipy.org/mailman/listinfo/numpy-discussion
