On 05/20/2011 02:30 PM, Linda Walsh wrote: > i.e. it's showing me a 16-bit value: 0x203c, which I thought would be the > wide-char value for the double-exclamation. Going from the wchar > definition > on NT, it is a 16-bit value. Perhaps it is different under POSIX? but > 0x203c taken as 32 bits with 2 high bytes of zeros would seem to specify > the same codepoint for the Dbl-EXcl.
POSIX allows wchar_t to be either 2-byte or 4-byte, although only a 4-byte wchar_t can properly represent all of Unicode (with 2-byte wchar_t as on windows or Cygwin, you are inherently restricted from using any Unicode character larger than 0xffff if you want to maintain POSIX compliance). > >> Since there is no way to produce a word containing a NUL character it is >> impossible to support %lc in any useful way. > ---- > That's annoying. How can one print out unicode characters > that are supposed to be 1 char long? I think you are misunderstanding the difference between wide characters (exactly one wchar_t per character) and multi-byte characters (1 or more char [byte] per character). Unicode can be represented in two different ways. One way is with wide characters (every character represents exactly one Unicode codepoint, and code points < 0x100 have embedded NUL bytes if you view the memory containing those wchar_t as an array of bytes). The other way is with multi-byte encodings, such as UTF-8 (every character occupies a variable number of bytes, and the only character that can contain an embedded NUL byte is the NUL character at codepoint 0). Bash _only_ uses multi-byte characters for input and output. %lc only uses wchar_t. Since wchar_t output is not useful for a shell that does not do input in wchar_t, that explains why bash printf need not support %lc. POSIX doesn't require it, at any rate, but it also doesn't forbid it as an extension. > This isn't just a bash problem given how well most of the unix "character" > utils work with unicode -- that's something that really needs to be solved > if those character utils are going to continue to be _as useful_ in the > future. > Sure they will have their current functionality which is of use in many > ways, but > for anyone not processing ASCII text it becomes a problem, but this > isn't really > a bash is. Most utilities that work with Unicode work with UTF-8 (that is, with multi-byte-characters using variable number of bytes), and NOT with wide characters (that is, with all characters occupying a fixed width). But you can switch between encodings using the iconv(1) utility, so it shouldn't really be a problem in practice in converting from one encoding type to another. > That said, it was my impression that a wchar was 16-bits (at least it > is on MS. Is it different under POSIX? POSIX allows 16-bit wchar_t, but if you have a 16-bit wchar_t, you cannot support all of Unicode. -- Eric Blake ebl...@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org
signature.asc
Description: OpenPGP digital signature