On Thu, Jul 17, 2014 at 10:05 PM, Chris Barker <chris.bar...@noaa.gov> wrote: > A bit of a higher-level view of the issues at hand. > > Python has three relevant data types: > > A unicode type (unicode in py2, str in py3) > A one-byte-per-char stringtype (py2 string) > A bytes type > > The big problem is that py2 only has the unicode and py2string types, and > py3 only has the unicode and bytes type. > > numpy has 'S' and 'U' types: which map naturally to the py2string and > unicode types. > > but since py3 has no py2string type, we have a problem. > > If numpy were to embrace the py3 model, then 'S' should have mapped to py3's > string, aka unicode. > > But: > > 1) then there would be no bytes type, which is a problem, as people do need > to a pass collections of bytes around. I"ve alwyas figured numpy's uint8 > should suffice for that, but "strings of bytes" are useful, and it seem to > be awkward, or maybe impossible to construct such a beast with the usual > dtype machinery > > 2) there is a need (or at least a desire), to have a compact, > one-byte-per-charater text type in numpy. > > Thinking of it in this framework leads me to the conclusion that numpy > should have three types:
This sounds pretty reasonable to me. > 1) A unicode type --no change here > > 2) A bytes types -- almost the current 'S' type > - A bytes type would map to/from py3 bytes objects (and py2 bytes > objects, which are the same as py2strings) > - one way is would differ from a py2str is that there would be no > assumption of null-termination (not sure where that is now) AFAICT this is *exactly* the same as the current 'S' type. What differences do you see? > 3) A one-byte-per-char text type -- more or less Chuck's current proposal. > - it would map to/from the py3 string -- it is text after all > - it would be null-terminated Numpy strings types are never null-terminated ATM. They're null-padded, which is slightly different. When storing data in an S5, for instance, strings of length 5 have no nulls appending, strings of length 4 have 1 null appended, strings of length 3 have 2 nulls appended, etc. When reading data out of an S5, then all trailing nulls are stripped. So, they may not be null terminated (if the length of the string exactly matches the length of the dtype), and the strings being stored can contain internal nulls ("foo\x00bar" is fine), but they cannot contain trailing nulls ("foo\x00" will come back as just "foo"). Do you actually care about null-termination specifically? Or did you just mean "it should work like the other ones, which I vaguely remember involves nulls"? ;-) > - it would have a one-byte per-char encoding: ascii, latin-1 or settable > (TBA) Settable is technically very difficult until we redo the dtype machinery to allow parametrized types. -n -- Nathaniel J. Smith Postdoctoral researcher - Informatics - University of Edinburgh http://vorpus.org _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion