On Fri, Aug 3, 2012 at 7:03 PM, Travis Oliphant <tra...@continuum.io> wrote:
> Hey all, > > Ondrej has been working hard with feedback from many others on improving > Unicode support in NumPy (especially for Python 3.3). Looking at what > Python has done in Python 3.3 (PEP 393) and chatting on the Python issue > tracker with the author of that PEP has made me wonder if we aren't "doing > the wrong thing" in NumPy quite often. > > Basically, NumPy only supports UTF-32 in it's Unicode representation. > All bytes in NumPy arrays should be either UTF-32LE or UTF-32BE. This is > all pretty easy to understand as long as you stick with NumPy arrays only. > > The difficulty starts when you start to interact with the unicode array > scalar (which is the same data-structure exactly as a Python unicode object > with a different type-name --- numpy.unicode_). However, I overlooked > the "encoding" argument to the standard "unicode" constructor which might > have simplified what we are doing. If I understand things correctly, > now, all we need to do is to "decode" the UTF-32LE or UTF-32BE raw bytes in > the array (depending on the dtype) into a unicode object. > > This is easily accomplished with numpy.unicode_(<bytes object>, > 'utf_32_be' or 'utf_32_le'). There is also an "encoding" equivalent to > go from the Python unicode object to the bytes representation in the NumPy > array. I think this is what we should be doing in most of the places and > it should considerably simplify the Unicode code in NumPy --- eliminating > possibly the ucsnarrow.c file. > > Am I missing something? > > I can't comment on the rest, but I'd be happy to see the end of the ucsnarrow.c file. It needs more work to be properly generalized and if there is a way to avoid that, so much the better. Chuck
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion