On Mon, Jul 14, 2014 at 10:00 AM, Olivier Grisel <olivier.gri...@ensta.org> wrote:
> 2014-07-13 19:05 GMT+02:00 Alexander Belopolsky <ndar...@mac.com>: > > I've been toying with the idea of creating an array type for interned > > strings. In many applications dealing with large arrays of variable size > > strings, the strings come from a relatively short set of names. Arrays > of > > interned strings can be manipulated very efficiently because in may > respects > > they are just like arrays of integers. > > +1 I think this is why pandas is using dtype=object to load string > data: in many cases short string values are used to represent > categorical variables with a comparatively small cardinality of > possible values for a dataset with comparatively numerous records. > Pandas has a new "categorical" type (just merged into master) which is pretty similar to interned strings: https://github.com/pydata/pandas/pull/7217 http://pandas-docs.github.io/pandas-docs-travis/whatsnew.html Of course, it would be ideal for numpy itself to natively support categoricals and variables length strings. Best, Stephan
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion