BUG? RFE? printf lacking unicode support in multiple areas
It appears printf in bash doesn't support unicode characters in a couple of ways: 1) use of of the \u and \U escape sequences in the format string (16 and 32 bit Unicode values). 2) It doesn't handle the "%lc" conversion to print out wide characters. To demonstrate this I created a wide char for a double exclamation mark U+203C, using a=$'0x3c\0x20' and then tried to print "$a". From the list of supported formats, %lc should be valid as in the sprintf function: c If no l modifier is present, the int argument is converted to an unsigned char, and the resulting character is written. If an l modifier is present, the wint_t (wide character) argument is converted to a multibyte sequence by a call to the wcrtomb(3) function, with a conversion state starting in the initial state, and the resulting multibyte string is written. The gnu version of printf handles the \u and \U version, but doesn't appear to handle the "%lc" format specifier. I.e. /usr/bin/printf "\u203c" will print out the double exclamation mark on a tty that is using a font with it defined (like "Lucida Console"). It's not horribly vital but I noticed it wasn't supported when looking at character support in filenames...
Re: BUG? RFE? printf lacking unicode support in multiple areas
On Fri, May 20, 2011 at 10:31 AM, Linda Walsh wrote: > > It appears printf in bash doesn't support unicode > characters in a couple of ways: > > 1) use of of the \u and \U escape sequences > in the format string (16 and 32 bit Unicode values). $ printf '%s: \u6444\n' $BASH_VERSION 4.2.8(1)-release: 摄
Re: BUG? RFE? printf lacking unicode support in multiple areas
Linda Walsh writes: > 2) It doesn't handle the "%lc" conversion to print out wide > characters. To demonstrate this I created a wide char for a > double exclamation mark U+203C, using a=$'0x3c\0x20' and then That's not a wide character, that's a four character string. Since there is no way to produce a word containing a NUL character it is impossible to support %lc in any useful way. Andreas. -- Andreas Schwab, sch...@linux-m68k.org GPG Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different."
Re: BUG? RFE? printf lacking unicode support in multiple areas
On Fri, May 20, 2011 at 12:31:31AM -0700, Linda Walsh wrote: > 1) use of of the \u and \U escape sequences > in the format string (16 and 32 bit Unicode values). This isn't even a sentence. What bash command did you execute, and what did it do, and what did you expect it to do? In bash 4.2, on a Debian 6.0 box with a UTF-8 locale, printf '\u203c\n' prints the !! character (and a newline). You have not actually stated what you DID, and how it FAILED. > 2) It doesn't handle the "%lc" conversion to print out wide > characters. To demonstrate this I created a wide char for a > double exclamation mark U+203C, using a=$'0x3c\0x20' and then > tried to print "$a". What does a=$'...'; printf '%s\n' "$a" have to do with %lc? Even if you had correctly used the $'...' syntax, $'\x3c\x20' is NOT how you encode U+203C. Nor does it have anything to do with %lc, whatever that is. (I don't see it defined in POSIX http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap05.html for instance.) According to http://www.fileformat.info/info/unicode/char/203c/index.htm the UTF-8 encoding of U+203C is E2 80 BC. Thus: wooledg@wooledg:/var/tmp/bash/bash-4.2$ a=$'\xe2\x80\xbc'; printf '%s\n' "$a" ? Here the ? is the !! character being pasted across machines into my vim window where I'm writing this email. But trust me, it worked. > The gnu version of printf handles the \u and \U > version, but doesn't appear to handle the "%lc" format specifier. What's that got to do with bash? What does \u have to do with %lc? > I.e. /usr/bin/printf "\u203c" will print out the double exclamation mark > on a tty that is using a font with it defined (like "Lucida Console"). As I said above, bash 4.2's printf *also* handles this correctly. What did you do, and how did it fail?
Re: BUG? RFE? printf lacking unicode support in multiple areas
Greg Wooledge wrote: > On Fri, May 20, 2011 at 12:31:31AM -0700, Linda Walsh wrote: >> 1) use of of the \u and \U escape sequences >> in the format string (16 and 32 bit Unicode values). > > This isn't even a sentence. What bash command did you execute, and > what did it do, and what did you expect it to do? > > In bash 4.2, on a Debian 6.0 box with a UTF-8 locale, printf '\u203c\n' > prints the !! character (and a newline). You have not actually stated > what you DID, and how it FAILED. I am not Linda but in my setting (4.1.10(1)-release) under linux 64bit I have $ /usr/bin/printf "\u203c\n" ‼ but $ printf "\u203c\n" \u203c
Re: BUG? RFE? printf lacking unicode support in multiple areas
On 5/20/11 3:31 AM, Linda Walsh wrote: > > It appears printf in bash doesn't support unicode > characters in a couple of ways: > > 1) use of of the \u and \U escape sequences > in the format string (16 and 32 bit Unicode values). Bash-4.2 added support for the \u and \U format string escapes. They're still not in Posix, but should go in for the next revision. > 2) It doesn't handle the "%lc" conversion to print out wide > characters. Also not in Posix, and of questionable value at the shell level. Chet -- ``The lyf so short, the craft so long to lerne.'' - Chaucer ``Ars longa, vita brevis'' - Hippocrates Chet Ramey, ITS, CWRUc...@case.eduhttp://cnswww.cns.cwru.edu/~chet/
Re: BUG? RFE? printf lacking unicode support in multiple areas
On Fri, May 20, 2011 at 03:29:59PM +0200, Ralf Goertz wrote: > > In bash 4.2, on a Debian 6.0 box with a UTF-8 locale, printf '\u203c\n' > I am not Linda but in my setting (4.1.10(1)-release) under linux 64bit I > have > > $ printf "\u203c\n" > \u203c It it a bash 4.2 feature; it does not work in bash 4.1.
Re: Shell case statements
On 5/19/11 6:09 PM, Eric Blake wrote: > [adding bug-bash] > > On 05/16/2011 07:23 PM, Wayne Pollock wrote: >> (While cleaning up the standard for case statement, consider that it is >> currently >> unspecified what should happen if an error occurs during the expansion of the >> patterns; as expansions may have side-effects, when an error occurs on one >> expansion, should the following patterns be expanded anyway? Does it depend >> on >> the error? It seems reasonable to me that any errors should immediately >> terminate >> the case statement.) > > Well, that's rather all over the place, but yes, it does seem like bash > was the buggiest of the lot, compared to other shells. Interactively, I > tested: > > readonly x=1 > case 1 in $((x++)) ) echo hi1 ;; *) echo hi2; esac > echo $x.$? > > bash 4.1 printed: > bash: x: readonly variable > hi1 > 1.0 > which means it matched '1' to $((x++)) before reporting the failure > assign to x, and the case statement succeeded. Changing the first "1" > to any other string printed hi2 (the * case). Thanks for the report. This was an easy fix. The variable assignment error was actually handled correctly, the expression evaluation code just didn't pay enough attention to the result. Chet -- ``The lyf so short, the craft so long to lerne.'' - Chaucer ``Ars longa, vita brevis'' - Hippocrates Chet Ramey, ITS, CWRUc...@case.eduhttp://cnswww.cns.cwru.edu/~chet/
Re: Shell case statements
On 05/20/2011 09:33 AM, Chet Ramey wrote: >> Well, that's rather all over the place, but yes, it does seem like bash >> was the buggiest of the lot, compared to other shells. Interactively, I >> tested: >> >> readonly x=1 >> case 1 in $((x++)) ) echo hi1 ;; *) echo hi2; esac >> echo $x.$? >> >> bash 4.1 printed: >> bash: x: readonly variable >> hi1 >> 1.0 >> which means it matched '1' to $((x++)) before reporting the failure >> assign to x, and the case statement succeeded. Changing the first "1" >> to any other string printed hi2 (the * case). > > Thanks for the report. This was an easy fix. The variable assignment > error was actually handled correctly, the expression evaluation code > just didn't pay enough attention to the result. How about the even simpler: $ bash -c 'readonly x=5; echo $((x=5))'; echo $? bash: x: readonly variable 5 0 $ Other shells abort rather than running echo: $ ksh -c 'readonly x=5; echo $((x=5))'; echo $? ksh: line 1: x: is read only 1 $ zsh -c 'readonly x=5; echo $((x=5))'; echo $? zsh:1: read-only variable: x 1 $ dash -c 'readonly x=5; echo $((x=5))'; echo $? dash: x: is read only 2 $ -- Eric Blake ebl...@redhat.com+1-801-349-2682 Libvirt virtualization library http://libvirt.org signature.asc Description: OpenPGP digital signature
Re: Shell case statements
On 5/20/11 12:10 PM, Eric Blake wrote: > On 05/20/2011 09:33 AM, Chet Ramey wrote: >>> Well, that's rather all over the place, but yes, it does seem like bash >>> was the buggiest of the lot, compared to other shells. Interactively, I >>> tested: >>> >>> readonly x=1 >>> case 1 in $((x++)) ) echo hi1 ;; *) echo hi2; esac >>> echo $x.$? >>> >>> bash 4.1 printed: >>> bash: x: readonly variable >>> hi1 >>> 1.0 >>> which means it matched '1' to $((x++)) before reporting the failure >>> assign to x, and the case statement succeeded. Changing the first "1" >>> to any other string printed hi2 (the * case). >> >> Thanks for the report. This was an easy fix. The variable assignment >> error was actually handled correctly, the expression evaluation code >> just didn't pay enough attention to the result. > > How about the even simpler: > > $ bash -c 'readonly x=5; echo $((x=5))'; echo $? That's not simpler, that's exactly the same case: a variable assignment error during expression evaluation for arithmetic expansion. Chet -- ``The lyf so short, the craft so long to lerne.'' - Chaucer ``Ars longa, vita brevis'' - Hippocrates Chet Ramey, ITS, CWRUc...@case.eduhttp://cnswww.cns.cwru.edu/~chet/
Re: BUG? RFE? printf lacking unicode support in multiple areas
Pierre Gaston wrote: On Fri, May 20, 2011 at 10:31 AM, Linda Walsh wrote: It appears printf in bash doesn't support unicode characters in a couple of ways: 1) use of of the \u and \U escape sequences in the format string (16 and 32 bit Unicode values). $ printf '%s: \u6444\n' $BASH_VERSION 4.2.8(1)-release: 摄 Ah, thanks! My bash (4.0.x is too old...) Am in process of upgrading my distro, so that should help... Thanks for the common sense answer.
Re: BUG? RFE? printf lacking unicode support in multiple areas
Andreas Schwab wrote: Linda Walsh writes: 2) It doesn't handle the "%lc" conversion to print out wide characters. To demonstrate this I created a wide char for a double exclamation mark U+203C, using a=$'0x3c\0x20' and then That's not a wide character, that's a four character string. I don't know why I typed it in that way as it wasn't what I used in my examples. I often get distracted when typing in summaries and don't type in my examples as created. Will have to think about how to compensate for my distractibility, but inherent in the process is getting distracted away from using any compensation. *sigh* The 16-bit value I generated was done using: $'\x3c\x20' That generates a 16-bit value: echo -n $'\x3c\x20'|hexdump 000 203c 002 (default for hexdump is the "-x" param, which displays 16-bit values in hex. i.e. it's showing me a 16-bit value: 0x203c, which I thought would be the wide-char value for the double-exclamation. Going from the wchar definition on NT, it is a 16-bit value. Perhaps it is different under POSIX? but 0x203c taken as 32 bits with 2 high bytes of zeros would seem to specify the same codepoint for the Dbl-EXcl. Since there is no way to produce a word containing a NUL character it is impossible to support %lc in any useful way. That's annoying. How can one print out unicode characters that are supposed to be 1 char long? This isn't just a bash problem given how well most of the unix "character" utils work with unicode -- that's something that really needs to be solved if those character utils are going to continue to be _as useful_ in the future. Sure they will have their current functionality which is of use in many ways, but for anyone not processing ASCII text it becomes a problem, but this isn't really a bash is. That said, it was my impression that a wchar was 16-bits (at least it is on MS. Is it different under POSIX? @16bit, 0x203c would fit, and theoretically could benefit if %lc worked. I.e.: b=$'\x3c\x20' printf "%lc" "$b" Though without some changes, it wouldn't work for chars with \00 in them, so would be of questionable use. Oh well... Again, thanks to the previous person who pointed out the \u & \U enhancements...
Re: BUG? RFE? printf lacking unicode support in multiple areas
On 05/20/2011 02:30 PM, Linda Walsh wrote: > i.e. it's showing me a 16-bit value: 0x203c, which I thought would be the > wide-char value for the double-exclamation. Going from the wchar > definition > on NT, it is a 16-bit value. Perhaps it is different under POSIX? but > 0x203c taken as 32 bits with 2 high bytes of zeros would seem to specify > the same codepoint for the Dbl-EXcl. POSIX allows wchar_t to be either 2-byte or 4-byte, although only a 4-byte wchar_t can properly represent all of Unicode (with 2-byte wchar_t as on windows or Cygwin, you are inherently restricted from using any Unicode character larger than 0x if you want to maintain POSIX compliance). > >> Since there is no way to produce a word containing a NUL character it is >> impossible to support %lc in any useful way. > > That's annoying. How can one print out unicode characters > that are supposed to be 1 char long? I think you are misunderstanding the difference between wide characters (exactly one wchar_t per character) and multi-byte characters (1 or more char [byte] per character). Unicode can be represented in two different ways. One way is with wide characters (every character represents exactly one Unicode codepoint, and code points < 0x100 have embedded NUL bytes if you view the memory containing those wchar_t as an array of bytes). The other way is with multi-byte encodings, such as UTF-8 (every character occupies a variable number of bytes, and the only character that can contain an embedded NUL byte is the NUL character at codepoint 0). Bash _only_ uses multi-byte characters for input and output. %lc only uses wchar_t. Since wchar_t output is not useful for a shell that does not do input in wchar_t, that explains why bash printf need not support %lc. POSIX doesn't require it, at any rate, but it also doesn't forbid it as an extension. > This isn't just a bash problem given how well most of the unix "character" > utils work with unicode -- that's something that really needs to be solved > if those character utils are going to continue to be _as useful_ in the > future. > Sure they will have their current functionality which is of use in many > ways, but > for anyone not processing ASCII text it becomes a problem, but this > isn't really > a bash is. Most utilities that work with Unicode work with UTF-8 (that is, with multi-byte-characters using variable number of bytes), and NOT with wide characters (that is, with all characters occupying a fixed width). But you can switch between encodings using the iconv(1) utility, so it shouldn't really be a problem in practice in converting from one encoding type to another. > That said, it was my impression that a wchar was 16-bits (at least it > is on MS. Is it different under POSIX? POSIX allows 16-bit wchar_t, but if you have a 16-bit wchar_t, you cannot support all of Unicode. -- Eric Blake ebl...@redhat.com+1-801-349-2682 Libvirt virtualization library http://libvirt.org signature.asc Description: OpenPGP digital signature
Re: BUG? RFE? printf lacking unicode support in multiple areas
12345678901234567890123456789012345678901234567890123456789012345678901234567890 Greg Wooledge wrote: On Fri, May 20, 2011 at 12:31:31AM -0700, Linda Walsh wrote: 1) use of of the \u and \U escape sequences in the format string (16 and 32 bit Unicode values). This isn't even a sentence. What bash command did you execute, and what did it do, and what did you expect it to do? --- Um...maybe what it does in 4.2? Even if you had correctly used the $'...' syntax, $'\x3c\x20' is NOT how you encode U+203C. Nor does it have anything to do with %lc, --- Your information is invalid. %lc uses wide chars 'wchar_t or wint_t'. These are 16 bits on Win&cygwin and 32 on with glib. wchar_t is also defined as 'utf16' (as a type in the include header files on linux). That means from the page you so graciously point to: http://www.fileformat.info/info/unicode/char/203c/index.htm one would use the UTF-16 value...which is..um...gee, lets see 0x203c. Gosh, what'ya know! the UTF-8 encoding of U+203C is E2 80 BC. Which has nothing to do with the data input taken by the %lc format. If your terminal encoding is set to UTF8, it SHOULD output UTF-8 -- a multibyte string is specified as the output. wooledg@wooledg:/var/tmp/bash/bash-4.2$ a=$'\xe2\x80\xbc'; printf '%s\n' "$a" ? Here the ? is the !! character being pasted across machines into my vim window where I'm writing this email. But trust me, it worked. The gnu version of printf handles the \u and \U version, but doesn't appear to handle the "%lc" format specifier. What's that got to do with bash? Gee, I dunno maybe because it wasn't in my bash and when I did a man of printf, it showed me those formats so I tried them with printf as my first test? Normally bash follows conventions for its builtin utils as the ones that are not builtin...but you think Bash following such standards is unreasonable? What does \u have to do with %lc? --- Not much -- except that a a wide char of 0x203c output using %lc should output the same multi-byte char as \u203c. Did you get out of the wrong side of the bed? Your response drips with unnecessary hostility.