Hermann Peifer <[EMAIL PROTECTED]> wrote: > Jim wrote: >> Hermann Peifer <[EMAIL PROTECTED]> wrote: >> >>> printf \uHHHH is expected to print Unicode chars. This work fine in >>> most cases, but some legal code points are reported as errors: values >>> in the ASCII range and C1 control chars, and values between >>> U+D800..U+DFFF >>> >>> I would say that this behaviour is rather a bug than a feature. >>> >> >> Thanks for the report, but this is not some arbitrary restriction, >> but rather conformance to the standard (C99, ISO/IEC 10646) for >> "universal character name" syntax: >> >> http://www.open-std.org/jtc1/sc22/wg14/www/docs/n717.htm >> >> Here's part of printf.c, with a comment that probably came from >> a version of N717: >> >> /* A universal character name shall not specify a character short >> identifier in the range 00000000 through 00000020, 0000007F through >> 0000009F, or 0000D800 through 0000DFFF inclusive. A universal >> character name shall not designate a character in the required >> character set. */ >> if ((uni_value <= 0x9f >> && uni_value != 0x24 && uni_value != 0x40 && uni_value != 0x60) >> || (uni_value >= 0xd800 && uni_value <= 0xdfff)) >> error (EXIT_FAILURE, 0, _("invalid universal character name \\%c%0*x"), >> esc_char, (esc_char == 'u' ? 4 : 8), uni_value); >> >> >>> /usr/bin/printf: invalid universal character name \u0000 >>> /usr/bin/printf: invalid universal character name \u0001 >>> >> ... >> >> I can understand that you'd find the restriction surprising, >> but I wouldn't call it a bug. >> > Thanks for your swift reply. (BTW: are mails to [email protected] > not copied to gnu.utils.bug?)
No. That's a separate list. > I do acknowledge that C0 and C1 control chars are some sort of a > border case. It is true that the Unicode standard does not assign > *normative names* for them but rather adds the placeholder "<control>" > as a dummy name (btw, this was different in earlier versions of > Unicode). However, all C0 and C1 *code points* are at least included > in: > > http://www.unicode.org/charts/PDF/U0000.pdf > http://www.unicode.org/charts/PDF/U0080.pdf > http://www.unicode.org/Public/5.1.0/ucd/UnicodeData.txt > > And I didn't expect /usr/bin/printf to worry about normative or > non-normative names of Unicode chars, but rather print the chars > themselves. > > If we let the control chars question aside, it is still hard to > believe that it is not a bug that almost all ASCII chars 0020..007e > lead to EXIT_FAILURE. This rule is more than peculiar, to say the > least and it is also inconsistent with its own comment: > > if ((uni_value <= 0x9f > && uni_value != 0x24 && uni_value != 0x40 && uni_value != 0x60) > > > Only DOLLAR SIGN, COMMERCIAL AT and GRAVE ACCENT are legal in the > range 0x00..0x9f ? > > I still think that these 92 cases are bugs, rather than anything else: > > /usr/bin/printf: invalid universal character name \u0020 > /usr/bin/printf: invalid universal character name \u0021 ... I don't know the motivation for those exceptions. Paul Eggert added this feature 8 years ago, so things may have changed. FYI, there are plenty of odd-looking exceptions in this domain. For a taste, see the function, ucn_valid_in_identifier, in gcc's libcpp/charset.c That code determines that this is valid C99 code (with -fextended-identifiers): int ok\u09CB = 1; but this is not: int not_ok\u09FF = 1; _______________________________________________ Bug-coreutils mailing list [email protected] http://lists.gnu.org/mailman/listinfo/bug-coreutils
