Re: [Numpy-discussion] String type again.

Nathaniel Smith Fri, 18 Jul 2014 03:33:30 -0700

On Thu, Jul 17, 2014 at 10:05 PM, Chris Barker <chris.bar...@noaa.gov> wrote:
> A bit of a higher-level view of the issues at hand.
>
> Python has three relevant data types:
>
> A unicode type (unicode in py2, str in py3)
> A one-byte-per-char stringtype (py2 string)
> A bytes type
>
> The big problem is that py2 only has the unicode and py2string types, and
> py3 only has the unicode and bytes type.
>
> numpy has 'S' and 'U' types: which map naturally to the py2string and
> unicode types.
>
> but since py3 has no py2string type, we have a problem.
>
> If numpy were to embrace the py3 model, then 'S' should have mapped to py3's
> string, aka unicode.
>
> But:
>
> 1) then there would be no bytes type, which is a problem, as people do need
> to a pass collections of bytes around. I"ve alwyas figured numpy's uint8
> should suffice for that, but "strings of bytes" are useful, and it seem to
> be awkward, or maybe impossible to construct such a beast with the usual
> dtype machinery
>
> 2) there is a need (or at least a desire), to have a compact,
> one-byte-per-charater text type in numpy.
>
> Thinking of it in this framework leads me to the conclusion that numpy
> should have three types:


This sounds pretty reasonable to me.

> 1) A unicode type --no change here
>
> 2) A bytes types -- almost the current 'S' type
>     - A bytes type would map to/from py3 bytes objects (and py2 bytes
> objects, which are the same as py2strings)
>     - one way is would differ from a py2str is that there would be no
> assumption of null-termination (not sure where that is now)

AFAICT this is *exactly* the same as the current 'S' type. What
differences do you see?

> 3) A one-byte-per-char text type -- more or less Chuck's current proposal.
>    - it would map to/from the py3 string -- it is text after all
>    - it would be null-terminated

Numpy strings types are never null-terminated ATM. They're
null-padded, which is slightly different. When storing data in an S5,
for instance, strings of length 5 have no nulls appending, strings of
length 4 have 1 null appended, strings of length 3 have 2 nulls
appended, etc. When reading data out of an S5, then all trailing nulls
are stripped.

So, they may not be null terminated (if the length of the string
exactly matches the length of the dtype), and the strings being stored
can contain internal nulls ("foo\x00bar" is fine), but they cannot
contain trailing nulls ("foo\x00" will come back as just "foo").

Do you actually care about null-termination specifically? Or did you
just mean "it should work like the other ones, which I vaguely
remember involves nulls"? ;-)

>    - it would have a one-byte per-char encoding: ascii, latin-1 or settable
> (TBA)

Settable is technically very difficult until we redo the dtype
machinery to allow parametrized types.

-n

-- 
Nathaniel J. Smith
Postdoctoral researcher - Informatics - University of Edinburgh
http://vorpus.org
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] String type again.

Reply via email to