On Mon, Jul 14, 2014 at 10:00 AM, Olivier Grisel <olivier.gri...@ensta.org>
wrote:

> 2014-07-13 19:05 GMT+02:00 Alexander Belopolsky <ndar...@mac.com>:
> > I've been toying with the idea of creating an array type for interned
> > strings.  In many applications dealing with large arrays of variable size
> > strings, the strings come from a relatively short set of names.  Arrays
> of
> > interned strings can be manipulated very efficiently because in may
> respects
> > they are just like arrays of integers.
>
> +1 I think this is why pandas is using dtype=object to load string
> data: in many cases short string values are used to represent
> categorical variables with a comparatively small cardinality of
> possible values for a dataset with comparatively numerous records.
>

Pandas has a new "categorical" type (just merged into master) which is
pretty similar to interned strings:
https://github.com/pydata/pandas/pull/7217
http://pandas-docs.github.io/pandas-docs-travis/whatsnew.html

Of course, it would be ideal for numpy itself to natively support
categoricals and variables length strings.

Best,
Stephan
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Reply via email to