Re: Improper UTF-8 combining character handling

2007-06-12 Thread Sean Burke
I've retried with 3.2-17 with the same results. Notably, the issue isn't
(and has not been) that all multibyte characters are handled properly.
Instead, sequences which contain combining characters seem to treat the
sequence inconsistently. For example, the character that represents D
WITH DOT ABOVE, U+1E0A, is handled properly. However, the equivalent
sequence U+0044 + U+0307, consisting of D and COMBINING DOT ABOVE, is
not handled properly. Backspacing through the sequence removes both
characters with one backspace, but only the COMBINING DOT ABOVE glyph is
removed.

Most likely, bash is treating the sequence as a single character, either
because of specific semantics saying that a combining sequence is a
single character, or because the sequence is handled as its
normalization form C equivalent, the single D WITH DOT ABOVE character.
However, either way, the glyphs are being treated separately and deleted
one at a time. The best resolution to this, if it can be reproduced,
seems to be to treat each character and glyph in the combining sequence
separately unless specifically told to normalize (such as when the
argument is a filename).


Sean Burke

Scríobh Benno Schulenberg:
> Sean Burke wrote:
>   
>> The Unicode normalization test data at
>> http://www.unicode.org/Public/UNIDATA/NormalizationTest.txt 
>> contains many sequences of this sort. 
>> The first chara cter sequence, LATIN CAPITAL LETTER D WITH DOT 
>> ABOVE, does produce this problem.
>>  Paste it into the commandline, then backspace through it. The
>> problem should be  reproduced immediately.
>> 
>
> Cannot reproduce it with bash-3.2-17.  Please retry with patch level 
> 17.  Patch 16 specifically addresses multibyte characters.
>
> Benno
>
>   



___
Bug-bash mailing list
Bug-bash@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-bash


Re: Improper UTF-8 combining character handling

2007-06-12 Thread Andreas Schwab
Sean Burke <[EMAIL PROTECTED]> writes:

> I've retried with 3.2-17 with the same results. Notably, the issue isn't
> (and has not been) that all multibyte characters are handled properly.
> Instead, sequences which contain combining characters seem to treat the
> sequence inconsistently. For example, the character that represents D
> WITH DOT ABOVE, U+1E0A, is handled properly. However, the equivalent
> sequence U+0044 + U+0307, consisting of D and COMBINING DOT ABOVE, is
> not handled properly. Backspacing through the sequence removes both
> characters with one backspace, but only the COMBINING DOT ABOVE glyph is
> removed.

That looks like a bug in your terminal emulator.  The sequence U+0044
U+0307 should occupy exactly one screen column by the fact that the
second character combines with the first one, ie. it should render
identical to U+1E0A.  This works correctly with current versions of
xterm or konsole.

Andreas.

-- 
Andreas Schwab, SuSE Labs, [EMAIL PROTECTED]
SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany
PGP key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."


___
Bug-bash mailing list
Bug-bash@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-bash