BUG? RFE? printf lacking unicode support in multiple areas

2011-05-20 Thread Linda Walsh


It appears printf in bash doesn't support unicode
characters in a couple of ways:

1) use of of the \u and \U escape sequences
in the format string (16 and 32 bit Unicode values).

2) It doesn't handle the "%lc" conversion to print out wide
characters.  To demonstrate this I created a wide char for a
double exclamation mark U+203C, using a=$'0x3c\0x20' and then
tried to print "$a".


From the list of supported formats, %lc should be valid
as in the sprintf function:

  c  If no l modifier is present, the int argument is converted 
to an
 unsigned char, and the resulting character is written.  If 
an  l
 modifier  is  present,  the  wint_t (wide character) 
argument is
 converted to a multibyte sequence by a call  to  the  
wcrtomb(3)
 function, with a conversion state starting in the initial 
state,

 and the resulting multibyte string is written.


The gnu version of printf handles the \u and \U
version, but doesn't appear to handle the "%lc" format specifier.

I.e. /usr/bin/printf "\u203c" will print out the double exclamation mark
on a tty that is using a font with it defined (like "Lucida Console").

It's not horribly vital but I noticed it wasn't supported when looking 
at character support in filenames...







Re: BUG? RFE? printf lacking unicode support in multiple areas

2011-05-20 Thread Pierre Gaston
On Fri, May 20, 2011 at 10:31 AM, Linda Walsh  wrote:
>
> It appears printf in bash doesn't support unicode
> characters in a couple of ways:
>
> 1) use of of the \u and \U escape sequences
> in the format string (16 and 32 bit Unicode values).

$ printf '%s: \u6444\n' $BASH_VERSION
4.2.8(1)-release: 摄



Re: BUG? RFE? printf lacking unicode support in multiple areas

2011-05-20 Thread Andreas Schwab
Linda Walsh  writes:

> 2) It doesn't handle the "%lc" conversion to print out wide
> characters.  To demonstrate this I created a wide char for a
> double exclamation mark U+203C, using a=$'0x3c\0x20' and then

That's not a wide character, that's a four character string.  Since
there is no way to produce a word containing a NUL character it is
impossible to support %lc in any useful way.

Andreas.

-- 
Andreas Schwab, sch...@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."



Re: BUG? RFE? printf lacking unicode support in multiple areas

2011-05-20 Thread Greg Wooledge
On Fri, May 20, 2011 at 12:31:31AM -0700, Linda Walsh wrote:
> 1) use of of the \u and \U escape sequences
> in the format string (16 and 32 bit Unicode values).

This isn't even a sentence.  What bash command did you execute, and
what did it do, and what did you expect it to do?

In bash 4.2, on a Debian 6.0 box with a UTF-8 locale, printf '\u203c\n'
prints the !! character (and a newline).  You have not actually stated
what you DID, and how it FAILED.

> 2) It doesn't handle the "%lc" conversion to print out wide
> characters.  To demonstrate this I created a wide char for a
> double exclamation mark U+203C, using a=$'0x3c\0x20' and then
> tried to print "$a".

What does   a=$'...'; printf '%s\n' "$a"   have to do with %lc?

Even if you had correctly used the $'...' syntax, $'\x3c\x20' is NOT
how you encode U+203C.  Nor does it have anything to do with %lc,
whatever that is.  (I don't see it defined in POSIX
http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap05.html
for instance.)

According to http://www.fileformat.info/info/unicode/char/203c/index.htm
the UTF-8 encoding of U+203C is E2 80 BC.  Thus:

wooledg@wooledg:/var/tmp/bash/bash-4.2$ a=$'\xe2\x80\xbc'; printf '%s\n' "$a"
?

Here the ? is the !! character being pasted across machines into my
vim window where I'm writing this email.  But trust me, it worked.

> The gnu version of printf handles the \u and \U
> version, but doesn't appear to handle the "%lc" format specifier.

What's that got to do with bash?  What does \u have to do with %lc?

> I.e. /usr/bin/printf "\u203c" will print out the double exclamation mark
> on a tty that is using a font with it defined (like "Lucida Console").

As I said above, bash 4.2's printf *also* handles this correctly.  What
did you do, and how did it fail?



Re: BUG? RFE? printf lacking unicode support in multiple areas

2011-05-20 Thread Ralf Goertz
Greg Wooledge wrote:

> On Fri, May 20, 2011 at 12:31:31AM -0700, Linda Walsh wrote:
>> 1) use of of the \u and \U escape sequences
>> in the format string (16 and 32 bit Unicode values).
> 
> This isn't even a sentence.  What bash command did you execute, and
> what did it do, and what did you expect it to do?
> 
> In bash 4.2, on a Debian 6.0 box with a UTF-8 locale, printf '\u203c\n'
> prints the !! character (and a newline).  You have not actually stated
> what you DID, and how it FAILED.

I am not Linda but in my setting (4.1.10(1)-release) under linux 64bit I
have

$ /usr/bin/printf "\u203c\n" 
‼

but

$ printf "\u203c\n" 
\u203c






Re: BUG? RFE? printf lacking unicode support in multiple areas

2011-05-20 Thread Chet Ramey
On 5/20/11 3:31 AM, Linda Walsh wrote:
> 
> It appears printf in bash doesn't support unicode
> characters in a couple of ways:
> 
> 1) use of of the \u and \U escape sequences
> in the format string (16 and 32 bit Unicode values).

Bash-4.2 added support for the \u and \U format string escapes.
They're still not in Posix, but should go in for the next revision.

> 2) It doesn't handle the "%lc" conversion to print out wide
> characters.  

Also not in Posix, and of questionable value at the shell level.

Chet

-- 
``The lyf so short, the craft so long to lerne.'' - Chaucer
 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, ITS, CWRUc...@case.eduhttp://cnswww.cns.cwru.edu/~chet/



Re: BUG? RFE? printf lacking unicode support in multiple areas

2011-05-20 Thread Greg Wooledge
On Fri, May 20, 2011 at 03:29:59PM +0200, Ralf Goertz wrote:
> > In bash 4.2, on a Debian 6.0 box with a UTF-8 locale, printf '\u203c\n'

> I am not Linda but in my setting (4.1.10(1)-release) under linux 64bit I
> have
> 
> $ printf "\u203c\n" 
> \u203c

It it a bash 4.2 feature; it does not work in bash 4.1.



Re: Shell case statements

2011-05-20 Thread Chet Ramey
On 5/19/11 6:09 PM, Eric Blake wrote:
> [adding bug-bash]
> 
> On 05/16/2011 07:23 PM, Wayne Pollock wrote:
>> (While cleaning up the standard for case statement, consider that it is 
>> currently
>> unspecified what should happen if an error occurs during the expansion of the
>> patterns; as expansions may have side-effects, when an error occurs on one
>> expansion, should the following patterns be expanded anyway?  Does it depend 
>> on
>> the error?  It seems reasonable to me that any errors should immediately 
>> terminate
>> the case statement.)
> 
> Well, that's rather all over the place, but yes, it does seem like bash
> was the buggiest of the lot, compared to other shells.  Interactively, I
> tested:
> 
> readonly x=1
> case 1 in $((x++)) ) echo hi1 ;; *) echo hi2; esac
> echo $x.$?
> 
> bash 4.1 printed:
> bash: x: readonly variable
> hi1
> 1.0
> which means it matched '1' to $((x++)) before reporting the failure
> assign to x, and the case statement succeeded.  Changing the first "1"
> to any other string printed hi2  (the * case).

Thanks for the report.  This was an easy fix.  The variable assignment
error was actually handled correctly, the expression evaluation code
just didn't pay enough attention to the result.

Chet
-- 
``The lyf so short, the craft so long to lerne.'' - Chaucer
 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, ITS, CWRUc...@case.eduhttp://cnswww.cns.cwru.edu/~chet/



Re: Shell case statements

2011-05-20 Thread Eric Blake
On 05/20/2011 09:33 AM, Chet Ramey wrote:
>> Well, that's rather all over the place, but yes, it does seem like bash
>> was the buggiest of the lot, compared to other shells.  Interactively, I
>> tested:
>>
>> readonly x=1
>> case 1 in $((x++)) ) echo hi1 ;; *) echo hi2; esac
>> echo $x.$?
>>
>> bash 4.1 printed:
>> bash: x: readonly variable
>> hi1
>> 1.0
>> which means it matched '1' to $((x++)) before reporting the failure
>> assign to x, and the case statement succeeded.  Changing the first "1"
>> to any other string printed hi2  (the * case).
> 
> Thanks for the report.  This was an easy fix.  The variable assignment
> error was actually handled correctly, the expression evaluation code
> just didn't pay enough attention to the result.

How about the even simpler:

$ bash -c 'readonly x=5; echo $((x=5))'; echo $?
bash: x: readonly variable
5
0
$

Other shells abort rather than running echo:

$ ksh -c 'readonly x=5; echo $((x=5))'; echo $?
ksh: line 1: x: is read only
1
$ zsh -c 'readonly x=5; echo $((x=5))'; echo $?
zsh:1: read-only variable: x
1
$ dash -c 'readonly x=5; echo $((x=5))'; echo $?
dash: x: is read only
2
$

-- 
Eric Blake   ebl...@redhat.com+1-801-349-2682
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: Shell case statements

2011-05-20 Thread Chet Ramey
On 5/20/11 12:10 PM, Eric Blake wrote:
> On 05/20/2011 09:33 AM, Chet Ramey wrote:
>>> Well, that's rather all over the place, but yes, it does seem like bash
>>> was the buggiest of the lot, compared to other shells.  Interactively, I
>>> tested:
>>>
>>> readonly x=1
>>> case 1 in $((x++)) ) echo hi1 ;; *) echo hi2; esac
>>> echo $x.$?
>>>
>>> bash 4.1 printed:
>>> bash: x: readonly variable
>>> hi1
>>> 1.0
>>> which means it matched '1' to $((x++)) before reporting the failure
>>> assign to x, and the case statement succeeded.  Changing the first "1"
>>> to any other string printed hi2  (the * case).
>>
>> Thanks for the report.  This was an easy fix.  The variable assignment
>> error was actually handled correctly, the expression evaluation code
>> just didn't pay enough attention to the result.
> 
> How about the even simpler:
> 
> $ bash -c 'readonly x=5; echo $((x=5))'; echo $?

That's not simpler, that's exactly the same case: a variable assignment
error during expression evaluation for arithmetic expansion.

Chet
-- 
``The lyf so short, the craft so long to lerne.'' - Chaucer
 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, ITS, CWRUc...@case.eduhttp://cnswww.cns.cwru.edu/~chet/



Re: BUG? RFE? printf lacking unicode support in multiple areas

2011-05-20 Thread Linda Walsh



Pierre Gaston wrote:

On Fri, May 20, 2011 at 10:31 AM, Linda Walsh  wrote:

It appears printf in bash doesn't support unicode
characters in a couple of ways:

1) use of of the \u and \U escape sequences
in the format string (16 and 32 bit Unicode values).


$ printf '%s: \u6444\n' $BASH_VERSION
4.2.8(1)-release: 摄



Ah, thanks!   My bash (4.0.x is too old...)

Am in process of upgrading my distro, so that should help...

Thanks for the common sense answer.



Re: BUG? RFE? printf lacking unicode support in multiple areas

2011-05-20 Thread Linda Walsh



Andreas Schwab wrote:

Linda Walsh  writes:


2) It doesn't handle the "%lc" conversion to print out wide
characters.  To demonstrate this I created a wide char for a
double exclamation mark U+203C, using a=$'0x3c\0x20' and then


That's not a wide character, that's a four character string.


I don't know why I typed it in that way as it wasn't what I used in
my examples.   I often get distracted when typing in summaries and
don't type in my examples as created.   Will have to think about how
to compensate for my distractibility, but inherent in the process is
getting distracted away from using any compensation.  *sigh*

The 16-bit value I generated was done using:
   $'\x3c\x20'

That generates a 16-bit value:
echo -n $'\x3c\x20'|hexdump 
000 203c   
002


(default for hexdump is the "-x" param, which displays 16-bit values in hex.

i.e. it's showing me a 16-bit value: 0x203c, which I thought would be the
wide-char value for the double-exclamation.  Going from the wchar definition
on NT, it is a 16-bit value.  Perhaps it is different under POSIX? but
0x203c taken as 32 bits with 2 high bytes of zeros would seem to specify
the same codepoint for the Dbl-EXcl.


Since there is no way to produce a word containing a NUL character it is
impossible to support %lc in any useful way.


That's annoying.   How can one print out unicode characters
that are supposed to be 1 char long? 


This isn't just a bash problem given how well most of the unix "character"
utils work with unicode -- that's something that really needs to be solved
if those character utils are going to continue to be _as useful_ in the future.
Sure they will have their current functionality which is of use in many ways, 
but
for anyone not processing ASCII text it becomes a problem, but this isn't really
a bash is.

That said, it was my impression that a wchar was 16-bits (at least it
is on MS.  Is it different under POSIX?  @16bit, 0x203c would fit, and 
theoretically
could benefit if %lc worked.  I.e.:


b=$'\x3c\x20'
printf "%lc" "$b"


Though without some changes, it wouldn't work for chars with \00 in them, 
so would be of questionable use.


Oh well...

Again, thanks to the previous person who pointed out the \u & \U enhancements...







Re: BUG? RFE? printf lacking unicode support in multiple areas

2011-05-20 Thread Eric Blake
On 05/20/2011 02:30 PM, Linda Walsh wrote:
> i.e. it's showing me a 16-bit value: 0x203c, which I thought would be the
> wide-char value for the double-exclamation.  Going from the wchar
> definition
> on NT, it is a 16-bit value.  Perhaps it is different under POSIX? but
> 0x203c taken as 32 bits with 2 high bytes of zeros would seem to specify
> the same codepoint for the Dbl-EXcl.

POSIX allows wchar_t to be either 2-byte or 4-byte, although only a
4-byte wchar_t can properly represent all of Unicode (with 2-byte
wchar_t as on windows or Cygwin, you are inherently restricted from
using any Unicode character larger than 0x if you want to maintain
POSIX compliance).

> 
>> Since there is no way to produce a word containing a NUL character it is
>> impossible to support %lc in any useful way.
> 
> That's annoying.   How can one print out unicode characters
> that are supposed to be 1 char long?

I think you are misunderstanding the difference between wide characters
(exactly one wchar_t per character) and multi-byte characters (1 or more
char [byte] per character).

Unicode can be represented in two different ways.  One way is with wide
characters (every character represents exactly one Unicode codepoint,
and code points < 0x100 have embedded NUL bytes if you view the memory
containing those wchar_t as an array of bytes).  The other way is with
multi-byte encodings, such as UTF-8 (every character occupies a variable
number of bytes, and the only character that can contain an embedded NUL
byte is the NUL character at codepoint 0).

Bash _only_ uses multi-byte characters for input and output.  %lc only
uses wchar_t.  Since wchar_t output is not useful for a shell that does
not do input in wchar_t, that explains why bash printf need not support
%lc.  POSIX doesn't require it, at any rate, but it also doesn't forbid
it as an extension.

> This isn't just a bash problem given how well most of the unix "character"
> utils work with unicode -- that's something that really needs to be solved
> if those character utils are going to continue to be _as useful_ in the
> future.
> Sure they will have their current functionality which is of use in many
> ways, but
> for anyone not processing ASCII text it becomes a problem, but this
> isn't really
> a bash is.

Most utilities that work with Unicode work with UTF-8 (that is, with
multi-byte-characters using variable number of bytes), and NOT with wide
characters (that is, with all characters occupying a fixed width).  But
you can switch between encodings using the iconv(1) utility, so it
shouldn't really be a problem in practice in converting from one
encoding type to another.

> That said, it was my impression that a wchar was 16-bits (at least it
> is on MS.  Is it different under POSIX?

POSIX allows 16-bit wchar_t, but if you have a 16-bit wchar_t, you
cannot support all of Unicode.

-- 
Eric Blake   ebl...@redhat.com+1-801-349-2682
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: BUG? RFE? printf lacking unicode support in multiple areas

2011-05-20 Thread Linda Walsh

12345678901234567890123456789012345678901234567890123456789012345678901234567890

Greg Wooledge wrote:

On Fri, May 20, 2011 at 12:31:31AM -0700, Linda Walsh wrote:

1) use of of the \u and \U escape sequences
in the format string (16 and 32 bit Unicode values).


This isn't even a sentence.  What bash command did you execute, and
what did it do, and what did you expect it to do? 

---
Um...maybe what it does in 4.2?



Even if you had correctly used the $'...' syntax, $'\x3c\x20' is NOT
how you encode U+203C.

 Nor does it have anything to do with %lc,
---
Your information is invalid.

%lc uses wide chars 'wchar_t or wint_t'.   These are 16 bits on 
Win&cygwin and 32 on with glib.


wchar_t is also defined as 'utf16' (as a type in the include header files
on linux).   That means from the page you so graciously point to:

http://www.fileformat.info/info/unicode/char/203c/index.htm

one would use the UTF-16 value...which is..um...gee, lets see
0x203c.  Gosh, what'ya know!


the UTF-8 encoding of U+203C is E2 80 BC. 


Which has nothing to do with the data input taken by the %lc format.

If your terminal encoding is set to UTF8, it SHOULD output UTF-8 -- 
a multibyte string is specified as the output.




wooledg@wooledg:/var/tmp/bash/bash-4.2$ a=$'\xe2\x80\xbc'; printf '%s\n' "$a"
?

Here the ? is the !! character being pasted across machines into my
vim window where I'm writing this email.  But trust me, it worked.


The gnu version of printf handles the \u and \U
version, but doesn't appear to handle the "%lc" format specifier.


What's that got to do with bash?


Gee, I dunno maybe because it wasn't in my bash and when I did a man of
printf, it showed me those formats so I tried them with printf as my
first test?   Normally bash follows conventions for its builtin utils
as the ones that are not builtin...but you think Bash following
such standards is unreasonable?



What does \u have to do with %lc?

---
Not much -- except that a a wide char of 0x203c output using %lc
should output the same multi-byte char as \u203c.

Did you get out of the wrong side of the bed?  Your response drips with 
unnecessary hostility.