On May 6, 2005, at 3:17 AM, M.-A. Lemburg wrote: > You've got that wrong: Python let's you choose UCS-4 - > UCS-2 is the default. > > Note that Python's Unicode codecs UTF-8 and UTF-16 > are surrogate aware and thus support non-BMP code points > regardless of the build type: A UCS2-build of Python will > store a non-BMP code point as UTF-16 surrogate pair in the > Py_UNICODE buffer while a UCS4 build will store it as a > single value. Decoding is surrogate aware too, so a UTF-16 > surrogate pair in a UCS2 build will get treated as single > Unicode code point.
If this is the case, then we're clearly misleading users. If the configure script says UCS-2, then as a user I would assume that surrogate pairs would *not* be encoded, because I chose UCS-2, and it doesn't support that. I would assume that any UTF-16 string I would read would be transcoded into the internal type (UCS-2), and information would be lost. If this is not the case, then what does the configure option mean? -- Nick _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com