Re: builtin printf behaves incorrectly with "c and 'c character-value arguments

2007-11-05 Thread Chet Ramey
Rich Felker wrote:
> $ printf %d\\n \'À
> -61
> (expected 192)
> 
> This should be 192 regardless of locale on any system where wchar_t
> values are ISO-10646/Unicode. Bash is incorrectly reading the first
> byte of the UTF-8 which happens to be -61 when interpreted as signed
> char; on a Latin-1 based locale it will probably give -63 instead.
> 
> Both POSIX and common sense are clear that the numeric values
> resulting from 'c should be the wchar_t value of c and not the value
> of the first byte of the multibyte character; from the SUSv3 printf(1)
> documentation:
> 
>  Note that in a locale with multi-byte characters, the value of a
>  character is intended to be the value of the equivalent of the
>  wchar_t representation of the character as described in the
>  System Interfaces volume of IEEE Std 1003.1-2001.
> 
> Language lawyers could argue that on 'single-byte' locales perhaps the
> byte value should be used; however, strictly speaking a single-byte
> locale is simply a special case of a multi-byte one, and sanity should
> win in any case.

You're correct that the bash printf should understand multibyte characters
in a multibyte locale, but not that returning a multibyte character when
a user hasn't asked for one by setting the locale is more "sane."

Chet
-- 
``The lyf so short, the craft so long to lerne.'' - Chaucer
   Live Strong.  No day but today.
Chet Ramey, ITS, CWRU[EMAIL PROTECTED]http://cnswww.cns.cwru.edu/~chet/




Re: try to open file descriptor for input with 'exec' fails

2007-11-05 Thread Chet Ramey
[EMAIL PROTECTED] wrote:

> Bash Version: 3.2
> Patch Level: 25
> Release Status: release
> 
> Description:
> In the following script i try to open a free file 
> descriptor for input from a file.
> The script should read lines out of a textfile, 
> output goes to stdout.
> This works fine till my last SUSE Linux 10.1 (sorry 
> I don't know the version of the bash).
> But now (opensuseLinux 10.3) the script aborts with 
> following error message:
> 
> ./doit: line 29: exec: 3: not found
> 
> This is the line where i try to open the file descriptor 
> for input:
> exec ${fd}<$inf

That form of redirection construct is not parsed the way you
are assuming.  The shell grammar has always required a number
before the `<' or `>' to specify a particular file descriptor.

Chet

-- 
``The lyf so short, the craft so long to lerne.'' - Chaucer
   Live Strong.  No day but today.
Chet Ramey, ITS, CWRU[EMAIL PROTECTED]http://cnswww.cns.cwru.edu/~chet/




Re: builtin printf behaves incorrectly with "c and 'c character-value arguments

2007-11-05 Thread Rich Felker
On Mon, Nov 05, 2007 at 09:10:29AM -0500, Chet Ramey wrote:
> Rich Felker wrote:
> > $ printf %d\\n \'À
> > -61
> > (expected 192)
> > 
> > This should be 192 regardless of locale on any system where wchar_t
> > values are ISO-10646/Unicode. Bash is incorrectly reading the first
> > byte of the UTF-8 which happens to be -61 when interpreted as signed
> > char; on a Latin-1 based locale it will probably give -63 instead.
> > 
> > Both POSIX and common sense are clear that the numeric values
> > resulting from 'c should be the wchar_t value of c and not the value
> > of the first byte of the multibyte character; from the SUSv3 printf(1)
> > documentation:
> > 
> >  Note that in a locale with multi-byte characters, the value of a
> >  character is intended to be the value of the equivalent of the
> >  wchar_t representation of the character as described in the
> >  System Interfaces volume of IEEE Std 1003.1-2001.
> > 
> > Language lawyers could argue that on 'single-byte' locales perhaps the
> > byte value should be used; however, strictly speaking a single-byte
> > locale is simply a special case of a multi-byte one, and sanity should
> > win in any case.
> 
> You're correct that the bash printf should understand multibyte characters
> in a multibyte locale, but not that returning a multibyte character when
> a user hasn't asked for one by setting the locale is more "sane."

I'm not sure what you mean. For a Latin-1 locale there is no
difference, but if the locale is a different legacy locale, the
wchar_t value (Unicode scalar value on systems with __STDC_ISO_10646__
defined) needs to be returned. If you're doubtful about the intent of
the standard, why not file a request for interpretation?

Rich




Re: builtin printf behaves incorrectly with "c and 'c character-value arguments

2007-11-05 Thread Chet Ramey
Rich Felker wrote:

> I'm not sure what you mean. For a Latin-1 locale there is no
> difference, but if the locale is a different legacy locale, the
> wchar_t value (Unicode scalar value on systems with __STDC_ISO_10646__
> defined) needs to be returned. If you're doubtful about the intent of
> the standard, why not file a request for interpretation?

I'm not doubtful about the standard's intent.  When the user has not
chosen to use a locale that contains multibyte characters, not only
should bash not second-guess the user by returning a multibyte
character, functions such as mbrtowc or mblen/mbrlen will not return
"multibyte" values (e.g., mbrlen will return `1' and mbrtowc will return
`-61' -- converted to 195, since it's unsigned -- as its wchar value
while converting 1 character in your example).

Chet
-- 
``The lyf so short, the craft so long to lerne.'' - Chaucer
   Live Strong.  No day but today.
Chet Ramey, ITS, CWRU[EMAIL PROTECTED]http://cnswww.cns.cwru.edu/~chet/




Re: builtin printf behaves incorrectly with "c and 'c character-value arguments

2007-11-05 Thread Rich Felker
On Mon, Nov 05, 2007 at 10:23:43PM -0500, Chet Ramey wrote:
> Rich Felker wrote:
> 
> > I'm not sure what you mean. For a Latin-1 locale there is no
> > difference, but if the locale is a different legacy locale, the
> > wchar_t value (Unicode scalar value on systems with __STDC_ISO_10646__
> > defined) needs to be returned. If you're doubtful about the intent of
> > the standard, why not file a request for interpretation?
> 
> I'm not doubtful about the standard's intent.  When the user has not
> chosen to use a locale that contains multibyte characters, not only
> should bash not second-guess the user by returning a multibyte
> character, functions such as mbrtowc or mblen/mbrlen will not return
> "multibyte" values (e.g., mbrlen will return `1' and mbrtowc will return
> `-61' -- converted to 195, since it's unsigned -- as its wchar value
> while converting 1 character in your example).

This 195 _is_ its value as a multibyte character in a locale with
ISO-8859-1 as its codeset. In such a locale, it's also the value of
the byte (interpreted as unsigned). So here it doesn't matter which
you use; either is equally correct.

Where something different happens is if your locale has a different
codeset. For instance, in KOI8-R, there is a character "²" which is
placed on a different byte (9B) than in ISO-8859 encodings (B2). But
regardless of your locale,

$ printf %d\\n \'²

should print 179, provided that your system implementation uses the
same values for wchar_t regardless of locale. These semantics are
useful because they actually tell you something about the identity of
the character. But most importantly, it's just illogical for the
function to behave differently based on whether MB_CUR_MAX is 1 or
something greater than 1, rather than being based on the actual locale
encoding. "²" is a "²" in a KOI8-R locale just as much as it is a "²"
in a UTF-8 locale. Bash's printf should not treat the KOI8-R locale
badly just because all characters happen to fit into one byte. The
mbrtowc function will give the correct result for all locales, whether
or not they have characters that take multiple bytes to represent;
special-casing locales that don't just gives illogical (and
non-conformant!) behavior.

Rich


P.S. For my own usage I'd be plenty happy as long as the bug is fixed
in UTF-8 based locales since that's all I ever intend to use. But I
maintain that the current behavior is incorrect and nonconformant in
other locales as well. If you want a compromise, why not make the
correct behavior be dependent on strict posix mode?