On May 6, 2005, at 7:43 PM, Martin v. Löwis wrote: > Nicholas Bastin wrote: >> If this is the case, then we're clearly misleading users. If the >> configure script says UCS-2, then as a user I would assume that >> surrogate pairs would *not* be encoded, because I chose UCS-2, and it >> doesn't support that. > > What do you mean by that? That the interpreter crashes if you try > to store a low surrogate into a Py_UNICODE?
What I mean is pretty clear. UCS-2 does *NOT* support surrogate pairs. If it did, it would be called UTF-16. If Python really supported UCS-2, then surrogate pairs from UTF-16 inputs would either get turned into two garbage characters, or the "I couldn't transcode this" UCS-2 code point (I don't remember which on that is off the top of my head). >> I would assume that any UTF-16 string I would >> read would be transcoded into the internal type (UCS-2), and >> information >> would be lost. If this is not the case, then what does the configure >> option mean? > > It tells you whether you have the two-octet form of the Universal > Character Set, or the four-octet form. It would, if that were the case, but it's not. Setting UCS-2 in the configure script really means UTF-16, and as such, the documentation should reflect that. -- Nick _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com