On Sat, Jun 25, 2011 at 6:17 AM, Matthew Brett <[email protected]>wrote:
> Hi, > > On Sat, Jun 25, 2011 at 2:10 AM, Mark Wiebe <[email protected]> wrote: > > On Fri, Jun 24, 2011 at 7:02 PM, Matthew Brett <[email protected]> > > wrote: > >> > >> Hi, > >> > >> On Sat, Jun 25, 2011 at 12:22 AM, Wes McKinney <[email protected]> > >> wrote: > >> ... > >> > Perhaps we should make a wiki page someplace summarizing pros and cons > >> > of the various implementation approaches? > >> > >> But - we should do this if it really is an open question which one we > >> go for. If not then, we're just slowing Mark down in getting to the > >> implementation. > >> > >> Assuming the question is still open, here's a starter for the pros and > >> cons: > >> > >> array.mask > >> 1) It's easier / neater to implement > > > > Yes > > > >> > >> 2) It can generalize across dtypes > > > > Yes > > > >> > >> 3) You can still get the masked data underneath the mask (allowing you > >> to unmask etc) > > > > By setting up views appropriately, yes. If you don't have another view to > > the underlying data, you can't get at it. > >> > >> nafloat64: > >> 1) No memory overhead > > > > Yes > > > >> > >> 2) Battle-tested implementation already done in R > > > > We can't really use that though, R is GPL and NumPy is BSD. The > low-level > > implementation details are likely different enough that a > re-implementation > > would be needed anyway. > > Right - I wasn't suggesting using the code, only that the idea can be > made to work coherently with an API that seems to have won friends > over time. > OK, so I think you mean a battle-tested implementation of the interface R exposes. That interface can be implemented with either masks or NA bit patterns, I don't believe it has anything specific to bit patterns inherent in it. > > >> I guess we'd have to test directly whether the non-continuous memory > >> of the mask and data would cause enough cache-miss problems to > >> outweigh the potential cycle-savings from single byte comparisons in > >> array.mask. > > > > The different memory buffers are each contiguous, so the access patterns > > still have a lot of coherency. I intend to give the mask memory layouts > > matching those of the arrays. > >> > >> I guess that one and only one of these will get written. I guess that > >> one of these choices may be a lot more satisfying to the current and > >> future masked array itch than the other. > > > > I'm only going to implement one solution, yes. > >> > >> I'm personally worried that the memory overhead of array.masks will > >> make many of us tend to avoid them. I work with images that can > >> easily get large enough that I would not want an array-items size byte > >> array added to my storage. > > > > May I ask what kind of dtypes and sizes you're working with? > > dtypes for images usually end up as floats - float32 or float64. On > disk, and when memory mapped, they are often int16 or uint16. Sizes > vary from fairly small 3D images of say 64 x 64 x 32 (1M in float64) > to rather large 4D images - say 256 x 256 x 50 x 500 at the very high > end (12.5G in float64). > OK, so the mask would be an extra 128KB or 1.6G, respectively. >> The reason I'm asking for more details about the implementation is > >> because that is most of the argument for array.mask at the moment (1 > >> and 2 above). > > > > I'm first trying to nail down more of the higher level requirements > before > > digging really deep into the implementation details. They greatly affect > how > > those details have to turn out. > > Once you've started with the array.mask framework, you've committed > yourself to the memory hit, and you may lose potential users who often > hit memory limits. My guess is that no-one currently using np.ma is > in that category, because it also uses a separate mask array, as I > understand it. > In the same way, if I start with the NA bit pattern framework, I've committed to throwing away the underlying values, and I will lose potential users who want to keep them. This tradeoff goes both ways, it looks like nobody would be completely satisfied with only one of the two approaches. -Mark > > See you, > > Matthew > _______________________________________________ > NumPy-Discussion mailing list > [email protected] > http://mail.scipy.org/mailman/listinfo/numpy-discussion >
_______________________________________________ NumPy-Discussion mailing list [email protected] http://mail.scipy.org/mailman/listinfo/numpy-discussion
