On Sun, Jun 17, 2012 at 6:10 AM, Nathaniel Smith <[email protected]> wrote: > On Wed, Jun 13, 2012 at 7:54 PM, Wes McKinney <[email protected]> wrote: >> It looks like the levels can only be strings. This is too limited for >> my needs. Why not support all possible NumPy dtypes? In pandas world, >> the levels can be any unique Index object > > It seems like there are three obvious options, from most to least general: > > 1) Allow levels to be an arbitrary collection of hashable Python objects > 2) Allow levels to be a homogenous collection of objects of any > arbitrary numpy dtype > 3) Allow levels to be chosen a few fixed types (strings and ints, I guess) > > I agree that (3) is a bit limiting. (1) is probably easier to > implement than (2). (2) is the most general, since of course > "arbitrary Python object" is a dtype. Is it useful to be able to > restrict levels to be of homogenous type? The main difference between > dtypes and python types is that (most) dtype scalars can be unboxed -- > is that substantively useful for levels? > >> What is the story for NA values (NaL?) in a factor array? I code them >> as -1 in the labels, though you could use INT32_MAX or something. This >> is very important in the context of groupby operations. > > If we have a type restriction on levels (options (2) or (3) above), > then how to handle out-of-bounds values is quite a problem, yeah. Once > we have NA dtypes then I suppose we could use those, but we don't yet. > It's tempting to just error out of any operation that encounters such > values. > >> Nathaniel: my experience (see blog posting above for a bit more) is >> that khash really crushes PyDict for two reasons: you can use it with >> primitive types and avoid boxing, and secondly you can preallocate. >> Its memory footprint with large hashtables is also a fraction of >> PyDict. The Python memory allocator is not problematic-- if you create >> millions of Python objects expect the RAM usage of the Python process >> to balloon absurdly. > > Right, I saw that posting -- it's clear that khash has a lot of > advantages as internal temporary storage for a specific operation like > groupby on unboxed types. But I can't tell whether those arguments > still apply now that we're talking about a long-term storage > representation for data that has to support a variety of operations > (many of which would require boxing/unboxing, since the API is in > Python), might or might not use boxed types, etc. Obviously this also > depends on which of the three options above we go with -- unboxing > doesn't even make sense for option (1). > > -n > _______________________________________________ > NumPy-Discussion mailing list > [email protected] > http://mail.scipy.org/mailman/listinfo/numpy-discussion
I'm in favor of option #2 (a lite version of what I'm doing currently-- I handle a few dtypes (PyObject, int64, datetime64, float64), though you'd have to go the code-generation route for all the dtypes to keep yourself sane if you do that. - Wes _______________________________________________ NumPy-Discussion mailing list [email protected] http://mail.scipy.org/mailman/listinfo/numpy-discussion
