David Byron: >> > And my ~/.inputrc contains: >> > >> > set meta-flag on >> > set convert-meta off >> > set input-meta on >> > set output-meta on >> >> Makes plenty of sense. But note that meta-flag is a synonym for >> input-meta, so you can remove one of them. > > I was just following the instructions at > http://cygwin.com/faq/faq-nochunks.html#faq.using.unicode
I see. FAQ maintainers, can we have the meta-flag removed? [time passes] Actually, it appears that bash/readline automatically sets those flags as shown if the locale is anything but "C". So since the default locale is "C.UTF-8" and non-ASCII stuff can't be expected to work in the "C" locale anyway, I think the whole FAQ entry could just be removed. Similarly, the commented out settings of those flags in /etc/skel/.inputrc could go. >> > $ echo $LC_ALL >> > en_US >> >> Hang on, where did that come from? > > When my cygwin.bat has set LANG=en_US.UTF-8, I get LANG=en_US.UTF-8 and > LC_ALL=en_US in bash. When my cygwin.bat doesn't set LANG, I get > LC_ALL=en_US and LANG isn't set. So where does LC_ALL get set? In the system-wide environment (in Computer->Properties->Advanced->Environment Variabes)? Or in one of the bash startup files? > I unset LC_ALL and... Where? I'm asking because if it's set to 'en_US' at the point bash is invoked, but unset afterwards, then bash will be using CP1252 while programs invoked by it will use UTF-8, which of course is bound to cause trouble ... > Now ls foo<tab> adds the actual accented character to the command line, but > when I press return I get: > > ls: cannot access foo<a gray box>: No such file or directory ... like that ... > when I pipe the error message to od -c, the gray box is octal 351 or 0xE9. > > I still get the right answer from test -f, when using the shell builtin. > /usr/bin/test tells me the file doesn't exist. .. and that. >> The \x18 scheme is only used for codepoints that can not be >> represented in the selected character set, yet U+00E9 can be >> represented CP1252. By definition, any Unicode codepoint can be >> represented in UTF-8, so the \x18 scheme is never used when that is >> selected. >> >> To enable C-style backslash interpretation, you need to use >> $'...' quoting. > > I now see the bash man page explains this. Must have missed it the first > time. The above paragraphs with some examples (where \x18 is needed and > where it isn't) added to > http://cygwin.com/cygwin-ug-net/using-specialnames.html#pathnames-unusual > would have gotten me farther before posting. But what I said is explained there already: "If you don't want or can't use UTF-8 as character set for whatever reason, you will nevertheless be able to access the file. How does that work? When Cygwin converts the filename from UTF-16 to your character set, it recognizes characters which can't be converted. If that occurs, Cygwin replaces the non-convertible character with a special character sequence. The sequence starts with an ASCII CAN character (hex code 0x18, equivalent Control-X), followed by the UTF-8 representation of the character. The result is a filename containing some ugly looking characters. While it doesn't look nice, it is nice, because Cygwin knows how to convert this filename back to UTF-16. The filename will be converted using your usual character set. However, when Cygwin recognizes an ASCII CAN character, it skips over the ASCII CAN and handles the following bytes as a UTF-8 character. Thus, the filename is symmetrically converted back to UTF-16 and you can access the file." Best to use UTF-8, though, and forget that you've ever heard about the ^X scheme. You're certainly not expected to have to enter \x18 on the command line to access non-ASCII filenames. >> Have a look in your root directory. There should be a file >> called x18 there. > > I don't see anything in my cygwin root (/) but I do see x18 in the root of > my C drive. Thanks. Ah yes, '\x18' is interpreted as a DOS path, so you get the root of your system drive rather than the Cygwin root. > And finally here are the steps that illustrate what's going on. > > $ touch $'\x18'; echo $? > 0 > > ls shows a file named up-arrow (0x18): What do you mean by up-arrow? I'm getting a question mark, because that's what ls prints for non-printable characters by default. You can choose various quoting styles using the --quoting style option. > $ ls<tab> > ^X > > which seems inconsistent. Yep, but that's a bash vs ls issue rather than a Cygwin one. You'd get the same on Linux. But if you use control characters in filenames, you better know what you're doing anyway. Some argue that it shouldn't be allowed in the first place, e.g. http://www.dwheeler.com/essays/fixing-unix-linux-filenames.html > $ mkshortcut -n shortcut$'\xC3\xA9' plain; echo $? > $ readshortcut shortcut$'\xE9' I'm afraid these aren't yet Unicode-ready, i.e. they still use Windows "ANSI" APIs. Andy -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple