On Tue, Jul 15, 2014 at 5:26 AM, Sebastian Berg <sebast...@sipsolutions.net> wrote:
> On Sa, 2014-07-12 at 12:17 -0500, Charles R Harris wrote: > > As previous posts have pointed out, Numpy's `S` type is currently > > treated as a byte string, which leads to more complicated code in > > python3. OTOH, the unicode type is stored as UCS4, which consumes a > > lot of space, especially for ascii strings. This note proposes to > > adapt the currently existing 'a' type letter, currently aliased to > > 'S', as a new fixed encoding dtype. Python 3.3 introduced two one byte > > internal representations for unicode strings, ascii and latin1. Ascii > > has the advantage that it is a subset of UTF-8, whereas latin1 has a > > few more symbols. Another possibility is to just make it an UTF-8 > > encoding, but I think this would involve more overhead as Python would > > need to determine the maximum character size. These are just > > preliminary thoughts, comments are welcome. > > > > Just wondering, couldn't we have a type which actually has an > (arbitrary, python supported) encoding (and "bytes" might even just be a > special case of no encoding)? Basically storing bytes and on access do > element[i].decode(specified_encoding) and on storing element[i] = > value.encode(specified_encoding). > > There is always the never ending small issue of trailing null bytes. If > we want to be fully compatible, such a type would have to store the > string length explicitly to support trailing null bytes. > UTF-8 encoding works with null bytes. That is one of the reasons it is so popular. Chuck
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion