[Bug binutils/27551] The default encoding of the strings utility does not conform to POSIX: should honor the current locale.

vincent-srcware at vinc17 dot net Fri, 09 Apr 2021 09:52:43 -0700

https://sourceware.org/bugzilla/show_bug.cgi?id=27551


--- Comment #13 from Vincent Lefèvre <vincent-srcware at vinc17 dot net> ---
(In reply to Nick Clifton from comment #12)
> (In reply to Vincent Lefèvre from comment #10)
> Hi Vincent,
> 
> > The bug is that:
> > 
> >   if (encoding == 's')
> >     buf[0] = c & 0x7f;
> > 
> > So the byte 0xc0 gets changed to 0x40, which is printable.
> 
> No - this is the correct behaviour.  The 's' encoding says that the
> characters in the file being examined are 7-bits long, not 8-bits.  Hence
> when a byte is read only the bottom 7 bits should be considered when
> deciding if the character is printable.

Then the 's' encoding must not be the default for non-ASCII encodings.

> > % printf "\300\300\300\300" | ./strings | iconv
> > iconv: illegal input sequence at position 0
> 
> But if we use your original test case and the patched strings:
> 
>   % printf "abcdéfghi" | ./strings | iconv 
>   abcdiconv: illegal input sequence at position 4
> 
>   % echo $LC_CTYPE
>   C.UTF-8

With the patched strings, I get under Debian/unstable:

zira% printf "abcdéfghi" | ./strings | iconv
abcdéfghi
zira% echo $LC_CTYPE
C.UTF-8

Perhaps your system doesn't support the C.UTF-8 locale.

> Are you saying that the length parameter passed to mbtowc() should include
> the first NUL byte ?

No, mbtowc() needs the whole UTF-8 sequence. For "é", that's "c3 a9". This
means that mbtowc() should get MB_CUR_MAX bytes to be sure to support all
printable characters (4 is sufficient for Unicode characters encoded in UTF-8).

-- 
You are receiving this mail because:
You are on the CC list for the bug.

[Bug binutils/27551] The default encoding of the strings utility does not conform to POSIX: should honor the current locale.

Reply via email to