Re: [Numpy-discussion] String type again.

Charles R Harris Thu, 17 Jul 2014 04:01:33 -0700

On Tue, Jul 15, 2014 at 5:26 AM, Sebastian Berg <sebast...@sipsolutions.net>
wrote:


> On Sa, 2014-07-12 at 12:17 -0500, Charles R Harris wrote:
> > As previous posts have pointed out, Numpy's `S` type is currently
> > treated as a byte string, which leads to more complicated code in
> > python3. OTOH, the unicode type is stored as UCS4, which consumes a
> > lot of space, especially for ascii strings. This note proposes to
> > adapt the currently existing 'a' type letter, currently aliased to
> > 'S', as a new fixed encoding dtype. Python 3.3 introduced two one byte
> > internal representations for unicode strings, ascii and latin1. Ascii
> > has the advantage that it is a subset of UTF-8, whereas latin1 has a
> > few more symbols. Another possibility is to just make it an UTF-8
> > encoding, but I think this would involve more overhead as Python would
> > need to determine the maximum character size. These are just
> > preliminary thoughts, comments are welcome.
> >
>
> Just wondering, couldn't we have a type which actually has an
> (arbitrary, python supported) encoding (and "bytes" might even just be a
> special case of no encoding)? Basically storing bytes and on access do
> element[i].decode(specified_encoding) and on storing element[i] =
> value.encode(specified_encoding).
>
> There is always the never ending small issue of trailing null bytes. If
> we want to be fully compatible, such a type would have to store the
> string length explicitly to support trailing null bytes.
>

UTF-8 encoding works with null bytes. That is one of the reasons it is so
popular.

Chuck

_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] String type again.

Reply via email to