[Duplicate message to honor the missing CC of bug-gnu-libic...@gnu.org] On Jan 29 08:10, Eric Blake wrote: > On 01/29/2011 05:30 AM, Corinna Vinschen wrote: > >> But when characters outside the basic plane, such as > >> U+12345 (CUNEIFORM SIGN URU TIMES KI), are encoded by 2 consecutive wchar_t > >> values, values of type wchar_t don't correspond to ISO/IEC 10646 > >> characters. > >> (Or maybe I'm underestimating what "coded representations" means...?) > > > > I don't read that from your above quote. The core is that the *type* > > wchar_t is a *coded* *representation* of the characters defined in > > 10646. At no point it says that a single wchar_t value must represent a > > single character from 10646. So I take it that UTF-16 is a valid, coded > > representation of the characters from 10646. > > POSIX is clear that wchar_t must be wide enough so that 1 wchar_t is one > character. Which limits a 2-byte wchar_t to just the Unicode basic > plane. There's nothing cygwin can do about this other than break LOTS > of ABI to support a 4-byte wchar_t to supply all of Unicode. > > http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap06.html#tag_06_03 > > "All wide-character codes in a given process consist of an equal number > of bits. This is in contrast to characters, which can consist of a > variable number of bytes. The byte or byte sequence that represents a > character can also be represented as a wide-character code. > Wide-character codes thus provide a uniform size for manipulating text > data." > > So, using UTF-16 surrogate encodings for characters outside the basic > plane violates POSIX, but it's the best we can do for those characters.
Right, and we discussed this already on this list. Or the developer list, I don't remember. Maybe we should have stick to the base plane and only use UCS-2 to be more POSIX compatible. I have to admit that I was more interested to get all (or as much as possible) of Unicode working than to follow POSIX to the last word in this regard. And I was interested to make sure that east asian users would get all of the characters used and there *are* the CJK idograpsh in the 0x2xxxx plane. However, the POSIX definition doesn't contradict what I said about the definition of __STDC_ISO_10646__ as far as I'm concerned. > Someday when gcc has better support for C+1x 16- and 32-bit characters > (regardless of the sizing of wchar_t), then we can add all the new > 32-bit character APIs that use Unicode unimpeded, without breaking > existing ones that use wchar_t. Yeah, that's what I'm waiting for as well. But for the time being, I'm confident that we have the best compromise possible at the time. Corinna -- Corinna Vinschen Please, send mails regarding Cygwin to Cygwin Project Co-Leader cygwin AT cygwin DOT com Red Hat -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple