2014-07-13 19:05 GMT+02:00 Alexander Belopolsky <ndar...@mac.com>: > > On Sat, Jul 12, 2014 at 8:02 PM, Nathaniel Smith <n...@pobox.com> wrote: >> >> I feel like for most purposes, what we *really* want is a variable length >> string dtype (I.e., where each element can be a different length.). > > > > I've been toying with the idea of creating an array type for interned > strings. In many applications dealing with large arrays of variable size > strings, the strings come from a relatively short set of names. Arrays of > interned strings can be manipulated very efficiently because in may respects > they are just like arrays of integers.
+1 I think this is why pandas is using dtype=object to load string data: in many cases short string values are used to represent categorical variables with a comparatively small cardinality of possible values for a dataset with comparatively numerous records. In that case the dtype=object is not that bad as it just stores pointer on string objects managed by Python. It's possible to intern the strings manually at load time (I don't know if pandas or python already do it automatically in that case). The integer semantics is good for that case. Having an explicit dtype might be even better. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion