We had a big long discussion about this on this list a while back (maybe 2 yrs ago???) please search the archives to find it. Though I'm pretty sure that we never did come to a conclusion. I think it stared with wanting better support ofr unicode in loadtxt and the like, and ended up delving into other encodings for the 'U' dtype, and maybe a single byte string dtype (latin-1), or maybe a variable-size unicode object like Py3's, or...
However, it is absolutely a non-starter to change the binary representation of the 'S' type in any version of numpy. Due to the legacy of py2 (and, indeed, most computing environments) 'S' is a single byte string representation. And the binary representation is often really key to numpy use. Period, end of story. And that maps to a py2 string and py3 bytes object. py2 does, of course, have a Unicode object as well. If you want your code (and doctests, and ...) to be compatible, then you should probably go to Unicode strings everywhere. py3 now supports the u'string' no-op literal to make this easier. (though I guess the __repr__ won't tack on that 'u', which is going to be a problem for docstrings). Note also that py3 has added more an more "string-like" support to the bytes object, so it's not too bad to go bytes-only. -CHB On Tue, Sep 13, 2016 at 7:21 AM, Lluís Vilanova <vilan...@ac.upc.edu> wrote: > Sebastian Berg writes: > > > On Di, 2016-09-13 at 15:02 +0200, Lluís Vilanova wrote: > >> Hi! I'm giving a shot to issue #3184 [1], based on the observation > >> that the > >> string dtype ('S') under python 3 uses byte arrays instead of unicode > >> (the only > >> readable string type in python 3). > >> > >> This brings two major problems: > >> > >> * numpy code has to go through loops to open and read files as binary > >> data to > >> load text into a bytes array, and does not play well with users > >> providing > >> string (unicode) arguments > >> > >> * the repr of these arrays shows strings as b'text' instead of > >> 'text', which > >> breaks doctests of software built on numpy > >> > >> What I'm trying to do is make dtypes 'S' and 'U' equivalnt > >> (NPY_STRING and > >> NPY_UNICODE). > >> > >> Now the question. Keeping 'S' and 'U' as separate dtypes (but same > >> internal > >> implementation) will provide the best backwards compatibility, but is > >> more > >> cumbersome to implement. > > > I am not sure how that can be possible. Those types are fundamentally > > different in how they store their data. String types use one byte per > > character, unicode types will use 4 bytes per character. You can maybe > > default to unicode in more cases in python 3, but you cannot make them > > identical internally. > > BTW, by identical I mean having two externally visible types, but a common > implementation in python 3 (that of NPY_UNICODE). > > The as-sane but not backwards-compatible option (I'm asking if this is > acceptable) is to only retain 'S' (NPY_STRING), but with the NPY_UNICODE > implementation, and making 'U' (and np.unicode_) and alias for 'S' (and > np.string_). > > > Cheers, > Lluis > _______________________________________________ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > https://mail.scipy.org/mailman/listinfo/numpy-discussion > -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception chris.bar...@noaa.gov
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion