Re: builtin printf behaves incorrectly with "c and 'c character-value arguments

Rich Felker Mon, 05 Nov 2007 11:01:23 -0800

On Mon, Nov 05, 2007 at 09:10:29AM -0500, Chet Ramey wrote:
> Rich Felker wrote:
> > $ printf %d\\n \'À
> > -61
> > (expected 192)
> > 
> > This should be 192 regardless of locale on any system where wchar_t
> > values are ISO-10646/Unicode. Bash is incorrectly reading the first
> > byte of the UTF-8 which happens to be -61 when interpreted as signed
> > char; on a Latin-1 based locale it will probably give -63 instead.
> > 
> > Both POSIX and common sense are clear that the numeric values
> > resulting from 'c should be the wchar_t value of c and not the value
> > of the first byte of the multibyte character; from the SUSv3 printf(1)
> > documentation:
> > 
> >      Note that in a locale with multi-byte characters, the value of a
> >      character is intended to be the value of the equivalent of the
> >      wchar_t representation of the character as described in the
> >      System Interfaces volume of IEEE Std 1003.1-2001.
> > 
> > Language lawyers could argue that on 'single-byte' locales perhaps the
> > byte value should be used; however, strictly speaking a single-byte
> > locale is simply a special case of a multi-byte one, and sanity should
> > win in any case.
> 
> You're correct that the bash printf should understand multibyte characters
> in a multibyte locale, but not that returning a multibyte character when
> a user hasn't asked for one by setting the locale is more "sane."


I'm not sure what you mean. For a Latin-1 locale there is no
difference, but if the locale is a different legacy locale, the
wchar_t value (Unicode scalar value on systems with __STDC_ISO_10646__
defined) needs to be returned. If you're doubtful about the intent of
the standard, why not file a request for interpretation?

Rich

Re: builtin printf behaves incorrectly with "c and 'c character-value arguments

Reply via email to