2014-07-13 19:05 GMT+02:00 Alexander Belopolsky <ndar...@mac.com>:
>
> On Sat, Jul 12, 2014 at 8:02 PM, Nathaniel Smith <n...@pobox.com> wrote:
>>
>> I feel like for most purposes, what we *really* want is a variable length
>> string dtype (I.e., where each element can be a different length.).
>
>
>
> I've been toying with the idea of creating an array type for interned
> strings.  In many applications dealing with large arrays of variable size
> strings, the strings come from a relatively short set of names.  Arrays of
> interned strings can be manipulated very efficiently because in may respects
> they are just like arrays of integers.

+1 I think this is why pandas is using dtype=object to load string
data: in many cases short string values are used to represent
categorical variables with a comparatively small cardinality of
possible values for a dataset with comparatively numerous records.

In that case the dtype=object is not that bad as it just stores
pointer on string objects managed by Python. It's possible to intern
the strings manually at load time (I don't know if pandas or python
already do it automatically in that case). The integer semantics is
good for that case. Having an explicit dtype might be even better.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Reply via email to