Re: accents

2011-05-15 Thread Chet Ramey
On 5/10/11 9:17 AM, Greg Wooledge wrote:

>>> Is the accented character
>>> a single-byte character, or a multi-byte character, in your locale?
>>
>> a multi-byte character, i think
>> How to confirm that ?

(Keep in mind as you read my answers that I know very little more than
anyone else about Unicode combining characters and character composition.)

> 
>> $ echo /Users/thomas/Downloads/réz | h
>> + echo $'/Users/thomas/Downloads/re?\201z'
>> + hexdump -C
>>   2f 55 73 65 72 73 2f 74  68 6f 6d 61 73 2f 44 6f  
>> |/Users/thomas/Do|
>> 0010  77 6e 6c 6f 61 64 73 2f  72 65 cc 81 7a 0a|wnloads/re..z.|
>> 001e
> 
> Oh... now this is interesting.  In my locale (not the one I'm writing this
> email from, but the one I tested in), an é is 0xc3 0xa9 which is the UTF-8
> encoding of the Unicode character U+00E9, LATIN SMALL LETTER E WITH ACUTE.
> 
> In yours, however, it is 0x65 0xcc 0x81 which is U+0065 LATIN SMALL
> LETTER E followed by U+0301 COMBINING ACUTE ACCENT.

That's not valid UTF-8, since UTF-8 requires that the shortest sequence
be used to encode a character.  The general problem with combining
characters still exists (the one in the message I referenced in an
earlier reply), but this case has more to do with Mac OS X and its use
of both precomposed and decomposed UTF-8 than anything.

> I'm not intimately familiar with this stuff myself, but it looks like
> a real bastard to me... I thought the point of UTF-8 was that you could
> read it a byte at a time, and know when you encountered a byte that
> signified the start of a multi-byte character.  But apparently not!
> If I'm interpreting this COMBINING ACUTE ACCENT thing properly, the
> only indicator that you are in a multi-byte character comes with the
> *second* byte, so you have to backtrack.  What idiot thought this up?

It's a way to provide a general mechanism for combining characters.  Most
locales have unicode/utf-8 characters defined for the most common
accented characters (e.g., U+00E9), and the U+0301 stuff is a way to add
accents to less common characters without using up a character.  It is
going to be a bitch to handle.

> With that in mind, let's see if I can reproduce some of this problem.
> Please bear in mind that as I paste this from the test environment
> terminal into the email-writing terminal, I have to make some manual
> adjustments to preserve the observed output.

I doubt you would be able to reproduce this on any system but Mac OS X.
Mac OS X keeps filenames in decomposed Unicode and keyboard input in
precomposed Unicode.  Dragging and dropping filenames doesn't do the
decomposed-precomposed conversion.

> wooledg@wooledg:~$ touch $'re\xcc\x81z'
> wooledg@wooledg:~$ echo r?z
> r?z
> wooledg@wooledg:~$ echo r*z
> réz
> wooledg@wooledg:~$ ls -b r*z
> réz
> 
> The terminal, when presented with the string of bytes that is the filename,
> renders it as réz.  However, Bash's globbing does NOT recognize this as
> a three-character filename beginning with 'r' and ending with 'z', as
> the r?z glob was not expanded.  ls -b also doesn't think there is anything
> particularly noteworthy about this filename, which is slightly annoying.
> 
> (Bash's failure to glob this might be a second bug, or possibly another
> manifestation of the same bug you're pursuing.)

It's not a bug; that really is two characters. Just because U+00E9 and
the two-character combination U+0065 U+0301 look the same (I think the
term is identical graphemes) doesn't mean they are identical.

On RHEL 5 and Debian 9, at least, the file system stores filenames using
the same characters as used to create them.  You were able to recreate how
Mac OS X stores filenames, but:

> When I double-click and then middle-click to select and paste the filename
> as rendered by the terminal back into the terminal, however, I do not
> get re\xcc\x81z any more; rather, I get r\xc3\xa9z.  So my attempts
> to reproduce your reported problem in this way fail.

Because something does the decomposed-precomposed conversion.

> The next obvious way to reproduce the problem would be to get bash to
> produce the filename itself through tab completion, rather than pasting.
> With that in mind, I'll try to move the file to a different name that
> will be tab-completable.

The other difference is that drag-and-drop on Mac OS X (at least dropping
from the finder) produces full pathnames.  I was able to reproduce display
problems (which I haven't yet investigated) using that, but not using
tab completion in the way you did.

(And Mac OS X does seem to have a problem with wcwidth: wcwidth on U+0301
returns 1 instead of 0).

Chet
-- 
``The lyf so short, the craft so long to lerne.'' - Chaucer
 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, ITS, CWRUc...@case.eduhttp://cnswww.cns.cwru.edu/~chet/



Re: accents

2011-05-15 Thread Andreas Schwab
Chet Ramey  writes:

> On 5/10/11 9:17 AM, Greg Wooledge wrote:
>
>> In yours, however, it is 0x65 0xcc 0x81 which is U+0065 LATIN SMALL
>> LETTER E followed by U+0301 COMBINING ACUTE ACCENT.
>
> That's not valid UTF-8, since UTF-8 requires that the shortest sequence
> be used to encode a character.

0x65 0xcc 0x81 is the correct UTF-8 encoding for the two character
sequence U+0065 U+0301.

> The general problem with combining
> characters still exists (the one in the message I referenced in an
> earlier reply), but this case has more to do with Mac OS X and its use
> of both precomposed and decomposed UTF-8 than anything.

There is no such thing as "precomposed UTF-8" and "decomposed UTF-8".
UTF-8 is an encoding of Unicode, and both NFD and NFC are valid forms of
Unicode.

Andreas.

-- 
Andreas Schwab, sch...@linux-m68k.org
GPG Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."



Re: accents

2011-05-15 Thread Chet Ramey
On 5/15/11 6:38 PM, Andreas Schwab wrote:
> Chet Ramey  writes:
> 
>> On 5/10/11 9:17 AM, Greg Wooledge wrote:
>>
>>> In yours, however, it is 0x65 0xcc 0x81 which is U+0065 LATIN SMALL
>>> LETTER E followed by U+0301 COMBINING ACUTE ACCENT.
>>
>> That's not valid UTF-8, since UTF-8 requires that the shortest sequence
>> be used to encode a character.
> 
> 0x65 0xcc 0x81 is the correct UTF-8 encoding for the two character
> sequence U+0065 U+0301.

That's a non sequitor.  My point is that, as I read it, UTF-8 requires the
use of the shortest sequence that can represent a particular character.
In this case, that means that U+00E9 must be used to represent e with acute
intead of e plus U+0301.

This doesn't mean that Mac OS X and maybe bash don't have problems with
certain combining characters.

> 
>> The general problem with combining
>> characters still exists (the one in the message I referenced in an
>> earlier reply), but this case has more to do with Mac OS X and its use
>> of both precomposed and decomposed UTF-8 than anything.
> 
> There is no such thing as "precomposed UTF-8" and "decomposed UTF-8".

Sorry, I meant to write "unicode".

> UTF-8 is an encoding of Unicode, and both NFD and NFC are valid forms of
> Unicode.

Sure, nobody's arguing that.  The point is that the utf-8 encodings
of precomposed and decomposed unicode are different, so you're not going
to see the same byte sequence on the keyboard as the file system on Mac
OS X.  Applications have to work around that.

Chet
-- 
``The lyf so short, the craft so long to lerne.'' - Chaucer
 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, ITS, CWRUc...@case.eduhttp://cnswww.cns.cwru.edu/~chet/



Re: accents

2011-05-15 Thread Chet Ramey
On 5/15/11 6:16 PM, Chet Ramey wrote:

> The other difference is that drag-and-drop on Mac OS X (at least dropping
> from the finder) produces full pathnames.  I was able to reproduce display
> problems (which I haven't yet investigated) using that, but not using
> tab completion in the way you did.
> 
> (And Mac OS X does seem to have a problem with wcwidth: wcwidth on U+0301
> returns 1 instead of 0).

I did a little more investigating, and it looks like that's the problem.

Chet
-- 
``The lyf so short, the craft so long to lerne.'' - Chaucer
 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, ITS, CWRUc...@case.eduhttp://cnswww.cns.cwru.edu/~chet/