in 0.15.0 pandas will have full fledged support for categoricals which in effect allow u 2 map a smaller number of strings to integers
this is now in pandas master http://pandas-docs.github.io/pandas-docs-travis/categorical.html feedback welcome! > On Jul 14, 2014, at 1:00 PM, Olivier Grisel <olivier.gri...@ensta.org> wrote: > > 2014-07-13 19:05 GMT+02:00 Alexander Belopolsky <ndar...@mac.com>: >> >>> On Sat, Jul 12, 2014 at 8:02 PM, Nathaniel Smith <n...@pobox.com> wrote: >>> >>> I feel like for most purposes, what we *really* want is a variable length >>> string dtype (I.e., where each element can be a different length.). >> >> >> >> I've been toying with the idea of creating an array type for interned >> strings. In many applications dealing with large arrays of variable size >> strings, the strings come from a relatively short set of names. Arrays of >> interned strings can be manipulated very efficiently because in may respects >> they are just like arrays of integers. > > +1 I think this is why pandas is using dtype=object to load string > data: in many cases short string values are used to represent > categorical variables with a comparatively small cardinality of > possible values for a dataset with comparatively numerous records. > > In that case the dtype=object is not that bad as it just stores > pointer on string objects managed by Python. It's possible to intern > the strings manually at load time (I don't know if pandas or python > already do it automatically in that case). The integer semantics is > good for that case. Having an explicit dtype might be even better. > > -- > Olivier > http://twitter.com/ogrisel - http://github.com/ogrisel > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion