Re: builtin printf behaves incorrectly with "c and 'c character-value arguments
Rich Felker wrote: > $ printf %d\\n \'À > -61 > (expected 192) > > This should be 192 regardless of locale on any system where wchar_t > values are ISO-10646/Unicode. Bash is incorrectly reading the first > byte of the UTF-8 which happens to be -61 when interpreted as signed > char; on a Latin-1 based locale it will probably give -63 instead. > > Both POSIX and common sense are clear that the numeric values > resulting from 'c should be the wchar_t value of c and not the value > of the first byte of the multibyte character; from the SUSv3 printf(1) > documentation: > > Note that in a locale with multi-byte characters, the value of a > character is intended to be the value of the equivalent of the > wchar_t representation of the character as described in the > System Interfaces volume of IEEE Std 1003.1-2001. > > Language lawyers could argue that on 'single-byte' locales perhaps the > byte value should be used; however, strictly speaking a single-byte > locale is simply a special case of a multi-byte one, and sanity should > win in any case. You're correct that the bash printf should understand multibyte characters in a multibyte locale, but not that returning a multibyte character when a user hasn't asked for one by setting the locale is more "sane." Chet -- ``The lyf so short, the craft so long to lerne.'' - Chaucer Live Strong. No day but today. Chet Ramey, ITS, CWRU[EMAIL PROTECTED]http://cnswww.cns.cwru.edu/~chet/
Re: try to open file descriptor for input with 'exec' fails
[EMAIL PROTECTED] wrote: > Bash Version: 3.2 > Patch Level: 25 > Release Status: release > > Description: > In the following script i try to open a free file > descriptor for input from a file. > The script should read lines out of a textfile, > output goes to stdout. > This works fine till my last SUSE Linux 10.1 (sorry > I don't know the version of the bash). > But now (opensuseLinux 10.3) the script aborts with > following error message: > > ./doit: line 29: exec: 3: not found > > This is the line where i try to open the file descriptor > for input: > exec ${fd}<$inf That form of redirection construct is not parsed the way you are assuming. The shell grammar has always required a number before the `<' or `>' to specify a particular file descriptor. Chet -- ``The lyf so short, the craft so long to lerne.'' - Chaucer Live Strong. No day but today. Chet Ramey, ITS, CWRU[EMAIL PROTECTED]http://cnswww.cns.cwru.edu/~chet/
Re: builtin printf behaves incorrectly with "c and 'c character-value arguments
On Mon, Nov 05, 2007 at 09:10:29AM -0500, Chet Ramey wrote: > Rich Felker wrote: > > $ printf %d\\n \'À > > -61 > > (expected 192) > > > > This should be 192 regardless of locale on any system where wchar_t > > values are ISO-10646/Unicode. Bash is incorrectly reading the first > > byte of the UTF-8 which happens to be -61 when interpreted as signed > > char; on a Latin-1 based locale it will probably give -63 instead. > > > > Both POSIX and common sense are clear that the numeric values > > resulting from 'c should be the wchar_t value of c and not the value > > of the first byte of the multibyte character; from the SUSv3 printf(1) > > documentation: > > > > Note that in a locale with multi-byte characters, the value of a > > character is intended to be the value of the equivalent of the > > wchar_t representation of the character as described in the > > System Interfaces volume of IEEE Std 1003.1-2001. > > > > Language lawyers could argue that on 'single-byte' locales perhaps the > > byte value should be used; however, strictly speaking a single-byte > > locale is simply a special case of a multi-byte one, and sanity should > > win in any case. > > You're correct that the bash printf should understand multibyte characters > in a multibyte locale, but not that returning a multibyte character when > a user hasn't asked for one by setting the locale is more "sane." I'm not sure what you mean. For a Latin-1 locale there is no difference, but if the locale is a different legacy locale, the wchar_t value (Unicode scalar value on systems with __STDC_ISO_10646__ defined) needs to be returned. If you're doubtful about the intent of the standard, why not file a request for interpretation? Rich
Re: builtin printf behaves incorrectly with "c and 'c character-value arguments
Rich Felker wrote: > I'm not sure what you mean. For a Latin-1 locale there is no > difference, but if the locale is a different legacy locale, the > wchar_t value (Unicode scalar value on systems with __STDC_ISO_10646__ > defined) needs to be returned. If you're doubtful about the intent of > the standard, why not file a request for interpretation? I'm not doubtful about the standard's intent. When the user has not chosen to use a locale that contains multibyte characters, not only should bash not second-guess the user by returning a multibyte character, functions such as mbrtowc or mblen/mbrlen will not return "multibyte" values (e.g., mbrlen will return `1' and mbrtowc will return `-61' -- converted to 195, since it's unsigned -- as its wchar value while converting 1 character in your example). Chet -- ``The lyf so short, the craft so long to lerne.'' - Chaucer Live Strong. No day but today. Chet Ramey, ITS, CWRU[EMAIL PROTECTED]http://cnswww.cns.cwru.edu/~chet/
Re: builtin printf behaves incorrectly with "c and 'c character-value arguments
On Mon, Nov 05, 2007 at 10:23:43PM -0500, Chet Ramey wrote: > Rich Felker wrote: > > > I'm not sure what you mean. For a Latin-1 locale there is no > > difference, but if the locale is a different legacy locale, the > > wchar_t value (Unicode scalar value on systems with __STDC_ISO_10646__ > > defined) needs to be returned. If you're doubtful about the intent of > > the standard, why not file a request for interpretation? > > I'm not doubtful about the standard's intent. When the user has not > chosen to use a locale that contains multibyte characters, not only > should bash not second-guess the user by returning a multibyte > character, functions such as mbrtowc or mblen/mbrlen will not return > "multibyte" values (e.g., mbrlen will return `1' and mbrtowc will return > `-61' -- converted to 195, since it's unsigned -- as its wchar value > while converting 1 character in your example). This 195 _is_ its value as a multibyte character in a locale with ISO-8859-1 as its codeset. In such a locale, it's also the value of the byte (interpreted as unsigned). So here it doesn't matter which you use; either is equally correct. Where something different happens is if your locale has a different codeset. For instance, in KOI8-R, there is a character "²" which is placed on a different byte (9B) than in ISO-8859 encodings (B2). But regardless of your locale, $ printf %d\\n \'² should print 179, provided that your system implementation uses the same values for wchar_t regardless of locale. These semantics are useful because they actually tell you something about the identity of the character. But most importantly, it's just illogical for the function to behave differently based on whether MB_CUR_MAX is 1 or something greater than 1, rather than being based on the actual locale encoding. "²" is a "²" in a KOI8-R locale just as much as it is a "²" in a UTF-8 locale. Bash's printf should not treat the KOI8-R locale badly just because all characters happen to fit into one byte. The mbrtowc function will give the correct result for all locales, whether or not they have characters that take multiple bytes to represent; special-casing locales that don't just gives illogical (and non-conformant!) behavior. Rich P.S. For my own usage I'd be plenty happy as long as the bug is fixed in UTF-8 based locales since that's all I ever intend to use. But I maintain that the current behavior is incorrect and nonconformant in other locales as well. If you want a compromise, why not make the correct behavior be dependent on strict posix mode?