Re: [Numpy-discussion] Unicode revisited

Charles R Harris Fri, 03 Aug 2012 19:05:25 -0700

On Fri, Aug 3, 2012 at 7:03 PM, Travis Oliphant <tra...@continuum.io> wrote:


> Hey all,
>
> Ondrej has been working hard with feedback from many others on improving
> Unicode support in NumPy (especially for Python 3.3).   Looking at what
> Python has done in Python 3.3 (PEP 393) and chatting on the Python issue
> tracker with the author of that PEP has made me wonder if we aren't "doing
> the wrong thing" in NumPy quite often.
>
> Basically, NumPy only supports UTF-32 in it's Unicode representation.
> All bytes in NumPy arrays should be either UTF-32LE or UTF-32BE.    This is
> all pretty easy to understand as long as you stick with NumPy arrays only.
>
> The difficulty starts when you start to interact with the unicode array
> scalar (which is the same data-structure exactly as a Python unicode object
> with a different type-name --- numpy.unicode_).    However, I overlooked
> the "encoding" argument to the standard "unicode" constructor which might
> have simplified what we are doing.    If I understand things correctly,
> now, all we need to do is to "decode" the UTF-32LE or UTF-32BE raw bytes in
> the array (depending on the dtype) into a unicode object.
>
> This is easily accomplished with  numpy.unicode_(<bytes object>,
> 'utf_32_be'  or 'utf_32_le').    There is also an "encoding" equivalent to
> go from the Python unicode object to the bytes representation in the NumPy
> array.   I think this is what we should be doing in most of the places and
> it should considerably simplify the Unicode code in NumPy --- eliminating
> possibly the ucsnarrow.c file.
>
> Am I missing something?
>
>
I can't comment on the rest, but I'd be happy to see the end of the
ucsnarrow.c file. It needs more work to be properly generalized and if
there is a way to avoid that, so much the better.

Chuck

_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion

Re: [Numpy-discussion] Unicode revisited

Reply via email to