Nicholas Bastin wrote: > > On May 6, 2005, at 5:21 PM, Shane Hathaway wrote: >> Wait... are you saying a Py_UNICODE array contains either UTF-16 or >> UTF-32 characters, but never UCS-2? That's a big surprise to me. I may >> need to change my PyXPCOM patch to fit this new understanding. I tried >> hard to not care how Python encodes unicode characters, but details like >> this are important when combining two frameworks with different unicode >> APIs. > > > Yes. Well, in as much as a large part of UTF-16 directly overlaps > UCS-2, then sometimes unicode strings contain UCS-2 characters. > However, characters which would not be legal in UCS-2 are still encoded > properly in python, in UTF-16. > > And yes, I feel your pain, that's how I *got* into this position. > Mapping from external unicode types is an important aspect of writing > extension modules, and the documentation does not help people trying to > do this. The fact that python's internal encoding is variable is a huge > problem in and of itself, even if that was documented properly. This is > why tools like Xerces and ICU will be happy to give you whatever form of > unicode strings you want, but internally they always use UTF-16 - to > avoid having to write two internal implementations of the same > functionality. If you look up and down Objects/unicodeobject.c you'll > see a fair amount of code written a couple of different ways (using > #ifdef's) because of the variability in the internal representation.
Ok. Thanks for helping me understand where Python is WRT unicode. I can work around the issues (or maybe try to help solve them) now that I know the current state of affairs. If Python correctly handled UTF-16 strings internally, we wouldn't need the UCS-4 configuration switch, would we? Shane _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com