Sebastian Berg writes: > On Di, 2016-09-13 at 15:02 +0200, Lluís Vilanova wrote: >> Hi! I'm giving a shot to issue #3184 [1], based on the observation >> that the >> string dtype ('S') under python 3 uses byte arrays instead of unicode >> (the only >> readable string type in python 3). >> >> This brings two major problems: >> >> * numpy code has to go through loops to open and read files as binary >> data to >> load text into a bytes array, and does not play well with users >> providing >> string (unicode) arguments >> >> * the repr of these arrays shows strings as b'text' instead of >> 'text', which >> breaks doctests of software built on numpy >> >> What I'm trying to do is make dtypes 'S' and 'U' equivalnt >> (NPY_STRING and >> NPY_UNICODE). >> >> Now the question. Keeping 'S' and 'U' as separate dtypes (but same >> internal >> implementation) will provide the best backwards compatibility, but is >> more >> cumbersome to implement.
> I am not sure how that can be possible. Those types are fundamentally > different in how they store their data. String types use one byte per > character, unicode types will use 4 bytes per character. You can maybe > default to unicode in more cases in python 3, but you cannot make them > identical internally. BTW, by identical I mean having two externally visible types, but a common implementation in python 3 (that of NPY_UNICODE). The as-sane but not backwards-compatible option (I'm asking if this is acceptable) is to only retain 'S' (NPY_STRING), but with the NPY_UNICODE implementation, and making 'U' (and np.unicode_) and alias for 'S' (and np.string_). Cheers, Lluis _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org https://mail.scipy.org/mailman/listinfo/numpy-discussion