On 02/22/2012 07:43 PM, John Kearney wrote:
> ^ caviot you can represent the full 0x10ffff in UTF-16, you just need 2
> UTF-16 characters. check out the latest version of unicode.c for an
> example how.

Yes, and Cygwin actually does this.

A strict reading of POSIX states that wchar_t must be wide enough for
all supported characters, technically limiting things to just the basic
plane if you have 16-bit wchar_t and a POSIX-compliant app.  But cygwin
has exploited a loophole in the POSIX wording - POSIX does not require
that all bit patterns are valid characters.  So the actual Cygwin
implementation is that on paper, rather than representing all 65536
patterns as valid characters, the values used in surrogate halves
(0xd800 to 0xdfff) are listed as non-characters (so the use of them
triggers undefined behavior per POSIX), but actually using them treats
them as surrogate pairs (leading to the full Unicode character set, but
reintroducing the headaches that multibyte characters had with 'char',
but now with wchar_t, where you are back to dealing with variable-sized
character elements).

Furthermore, the mess of 16-bit vs. 32-bit wchar_t is one of the reasons
why C11 has introduced two new character types, 16-bit and 32-bit
characters, designed to fully map to the full Unicode set, regardless of
what size wchar_t is.  It will be interesting to see how the next
version of POSIX takes the additions of C11 and retrofits the other
wide-character functions in POSIX but not C99 to handle the new
character types.

-- 
Eric Blake   ebl...@redhat.com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to