On 02/22/2012 07:43 PM, John Kearney wrote: > ^ caviot you can represent the full 0x10ffff in UTF-16, you just need 2 > UTF-16 characters. check out the latest version of unicode.c for an > example how.
Yes, and Cygwin actually does this. A strict reading of POSIX states that wchar_t must be wide enough for all supported characters, technically limiting things to just the basic plane if you have 16-bit wchar_t and a POSIX-compliant app. But cygwin has exploited a loophole in the POSIX wording - POSIX does not require that all bit patterns are valid characters. So the actual Cygwin implementation is that on paper, rather than representing all 65536 patterns as valid characters, the values used in surrogate halves (0xd800 to 0xdfff) are listed as non-characters (so the use of them triggers undefined behavior per POSIX), but actually using them treats them as surrogate pairs (leading to the full Unicode character set, but reintroducing the headaches that multibyte characters had with 'char', but now with wchar_t, where you are back to dealing with variable-sized character elements). Furthermore, the mess of 16-bit vs. 32-bit wchar_t is one of the reasons why C11 has introduced two new character types, 16-bit and 32-bit characters, designed to fully map to the full Unicode set, regardless of what size wchar_t is. It will be interesting to see how the next version of POSIX takes the additions of C11 and retrofits the other wide-character functions in POSIX but not C99 to handle the new character types. -- Eric Blake ebl...@redhat.com +1-919-301-3266 Libvirt virtualization library http://libvirt.org
signature.asc
Description: OpenPGP digital signature