https://sourceware.org/bugzilla/show_bug.cgi?id=27551
--- Comment #13 from Vincent Lefèvre <vincent-srcware at vinc17 dot net> --- (In reply to Nick Clifton from comment #12) > (In reply to Vincent Lefèvre from comment #10) > Hi Vincent, > > > The bug is that: > > > > if (encoding == 's') > > buf[0] = c & 0x7f; > > > > So the byte 0xc0 gets changed to 0x40, which is printable. > > No - this is the correct behaviour. The 's' encoding says that the > characters in the file being examined are 7-bits long, not 8-bits. Hence > when a byte is read only the bottom 7 bits should be considered when > deciding if the character is printable. Then the 's' encoding must not be the default for non-ASCII encodings. > > % printf "\300\300\300\300" | ./strings | iconv > > iconv: illegal input sequence at position 0 > > But if we use your original test case and the patched strings: > > % printf "abcdéfghi" | ./strings | iconv > abcdiconv: illegal input sequence at position 4 > > % echo $LC_CTYPE > C.UTF-8 With the patched strings, I get under Debian/unstable: zira% printf "abcdéfghi" | ./strings | iconv abcdéfghi zira% echo $LC_CTYPE C.UTF-8 Perhaps your system doesn't support the C.UTF-8 locale. > Are you saying that the length parameter passed to mbtowc() should include > the first NUL byte ? No, mbtowc() needs the whole UTF-8 sequence. For "é", that's "c3 a9". This means that mbtowc() should get MB_CUR_MAX bytes to be sure to support all printable characters (4 is sufficient for Unicode characters encoded in UTF-8). -- You are receiving this mail because: You are on the CC list for the bug.