Hi Ambrose, and as usual much thanks for reporting bugs!

 On Thursday, December 7, 2006 at 12:01:58 -0500, Ambrose Li wrote:

> In an allegedly-GB2312 string, any high byte following pure ASCII
> should be treated as the lead byte of a presumed double-byte
> character, even if it is invalid GB2312. The sequences "?v", "?F",
> "?L", and "?h" are all meaningless and should all be simply "??"
> (because they are "unknown kanji", not pairs of "unknown 8-bit
> character followed by valid ASCII").

    Mutt is probably not at fault here, as the iconv command does the
same, in any locale. Minimal test case:

| $ printf "\xd6\x76\n" | iconv -c -f gb2312
| v

    I understand your point of view about "??" against "?v". But
I suppose one can argue that it can also make sense to restart as soon
as possible after something invalid, ie print the valid Ascii "v".
I vaguely recall having already stepped on such cases. Anyway here, for
this part of the problem, only iconv people can have the last word.


    Note that "?h" near end of string is special: It's invalid GB2312,
once seen as GB18030 it becomes valid, but even then it's still
unconvertable to BIG5.

| $ printf ">\xad\xf8<\n" | iconv -c -f gb2312
| ><
| $ printf ">\xad\xf8<\n" | iconv -c -f gb18030
| ><
| $ printf ">\xad\xf8<\n" | iconv -f gb18030 -t utf-8
| ><


> The output "祿" [b8 53] and "櫻" [c4 e5] cannot be explained; they
> don't seem to be related to the original gb18030 in any way.

    I don't have the same display here: Where you see "?祿??櫻?", I see
"??活??," , which seems to be correct given the string is "毬活動," in
GB18030. However strangely iconv doesn't agree:

| $ printf ">\x9a\xc2\xbb\xee\x84\xd3\xa3\xac<\n" | iconv -c -f gb2312
| ><

    ...no output at all. While GB18030 is of course OK:

| $ printf ">\x9a\xc2\xbb\xee\x84\xd3\xa3\xac<\n" | iconv -f gb18030
| >毬活動,<

    Explaining this brokenness is not so hard: In those 8 bytes, 5 of
the 7 possible pairs are either invalid GB2312, or unconvertable to
BIG-5. Only 2 pairs are OK.


> This *might* be the same bug as or a related bug of #249626

    I'd say 2 very related multi-byte resync problems, but different
causes: #249626 is only a Mutt bug, while this #402035 seems an iconv
problem, partly confused by a layer of #249626.

    To give the correct "??活??,", iconv should consider bytes in
GB2312 only by even pairs, even when the second high-bit byte could be
the first of a valid character (an odd pair)... It's obvious that it
would the thing to do in your specific example. But it could break other
situations. I'm unsure. A problem, right, but maybe not a bug?


Bye!    Alain.
-- 
Software should be written to deal with every conceivable error
        RFC 1122 / Robustness Principle

Reply via email to