David Byron: > I've read http://cygwin.com/faq/faq-nochunks.html#faq.using.unicode and > http://cygwin.com/cygwin-ug-net/setup-locale.html but I'm still stumped. > > My cygwin.bat now contains: > > @echo off > > C: > chdir C:\utils\cygwin\bin > set LANG=en_US.UTF-8 > bash --login -I > > And my ~/.inputrc contains: > > set meta-flag on > set convert-meta off > set input-meta on > set output-meta on
Makes plenty of sense. But note that meta-flag is a synonym for input-meta, so you can remove one of them. > $ echo $LC_ALL > en_US Hang on, where did that come from? LC_ALL overrides any other locale variables including LANG. Specifying a locale without a charset means that Cygwin 1.7.1 looks up your ANSI codepage. Assuming you're on a US system, this means you're getting CP1252, not UTF-8. (Note besides: Cygwin 1.7.2 changes to a Linux-compatible scheme for locales without explicit charset instead, where you'd get ISO-8859-1 instead.) > $ echo $LANG > en_US.UTF-8 > > For the rest of this post, assume <special_filename> is "foo" with U+00E9 (e > with acute accent) at the end. > > $ test -f <special_filename>; echo $? > > prints 1 when <special_filename> really does exist....depending on how I try > to represent U+00E9 on the command line > > $ ls foo<tab> > > adds the actual accented character to the command line (whether set > show-all-if-ambiguous on is in ~/.inputrc or not). Then I press return and > ls prints the filename. Then if I go through command history and change > "ls" to "test -f" and add the "; echo $?" I get the right answer from test. > So far so good. > > But, if I I try to do what > http://cygwin.com/cygwin-ug-net/using-specialnames.html#pathnames-unusual > says, the test command always fails, and ls doesn't print the filename. I'm > not really sure how to get hex code 0x18 through bash and to > ls/test/whatever properly. > > This what I tried: > > $ ls "foo\x18<tab>" > $ ls "foo\x18\xc3\xa9<tab>" > $ ls "foo\x18\xc3\xa9*" > > Note that 0xC3A9 is the UTF-8 encoding of U+00E9. There's a bunch of things wrong here. Due to the LC_ALL setting above, the U+00E9 is encoded as \xE9, not \xC3\xA9. The \x18 scheme is only used for codepoints that can not be represented in the selected character set, yet U+00E9 can be represented CP1252. By definition, any Unicode codepoint can be represented in UTF-8, so the \x18 scheme is never used when that is selected. Bash does not interpret \x specially when it appears in double quotes (or single quotes or unquoted): $ echo "\x18" \x18 To enable C-style backslash interpretation, you need to use $'...' quoting. Finally, it would appear that bash does not complete partial UTF-8 sequences, which makes sense, as it's probably dealing with wide characters internally. > But all get me nothing. Replacing "ls" with "test -f" gives me the same > nothing. Replacing \x with \X doesn't change anything either. > > Perhaps interesting is that if I pipe the ls command built with tab > completion that actually prints the filename to "od -c" I see > Then for kicks I tried: > > $ touch "\x18"; echo $? > 0 Have a look in your root directory. There should be a file called x18 there. > Can someone give me a hand coming up with a command line where I can build > up filenames that contain characters that have the high bit set (as well as > any non-ascii character really)? Just type them in. The 'US International' keyboard layout might be useful here. See http://en.wikipedia.org/wiki/Keyboard_layout#US-International. Otherwise, use $'...', and lose the unnecessary \x18s. Andy -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple