Hello, I have filenames on my system that are in latin1; these are installed as part of my distribution. However, I have my environment set up for UTF-8.
This appears to bring about a situation where gnulib's fnmatch() function fails to match some characters with '?' and '*'. The problem appears to affect current gnulib, but also coreutils 5.2.1, but not bash 3.00.16(1). Here's an example with coreutils:- $ ls | od -c 0000000 c a r r 351 . l o g o \n e n r o u 0000020 l 351 . l o g o \n e x e m p l e 1 0000040 . l o g o \n t r i a n g l e . l 0000060 o g o \n 0000064 $ ls -1 --ignore='*' carr?.logo enroul?.logo $ ls -1 carr?.logo enroul?.logo exemple1.logo triangle.logo In the above example, you will see that the filenames containing byte 0351 (octal), "LATIN SMALL LETTER E WITH ACUTE" in latin1, don't match the glob character '*'. Here's an example with current gnulib:- $ ~/source/GNU/findutils/cvs/fixbug/compile/find/find . -name '*' . ./triangle.logo ./exemple1.logo $ ~/source/GNU/findutils/cvs/fixbug/compile/find/find . . ./triangle.logo ./carr?.logo ./enroul?.logo ./exemple1.logo However, bash does not seem to be affected:- $ ls -1 * carr?.logo enroul?.logo exemple1.logo triangle.logo $ locale LANG=en_GB.UTF-8 LC_CTYPE="en_GB.UTF-8" LC_NUMERIC="en_GB.UTF-8" LC_TIME="en_GB.UTF-8" LC_COLLATE="en_GB.UTF-8" LC_MONETARY="en_GB.UTF-8" LC_MESSAGES="en_GB.UTF-8" LC_PAPER="en_GB.UTF-8" LC_NAME="en_GB.UTF-8" LC_ADDRESS="en_GB.UTF-8" LC_TELEPHONE="en_GB.UTF-8" LC_MEASUREMENT="en_GB.UTF-8" LC_IDENTIFICATION="en_GB.UTF-8" LC_ALL= Perhaps bash either isn't sensitive to whatever configuration error I have made, or it uses glob() or similar, instead of gnulib's fnmatch(). If I switch back to the C locale, the problem does not occur:- $ unset LANG ; locale LANG=POSIX LC_CTYPE="POSIX" LC_NUMERIC="POSIX" LC_TIME="POSIX" LC_COLLATE="POSIX" LC_MONETARY="POSIX" LC_MESSAGES="POSIX" LC_PAPER="POSIX" LC_NAME="POSIX" LC_ADDRESS="POSIX" LC_TELEPHONE="POSIX" LC_MEASUREMENT="POSIX" LC_IDENTIFICATION="POSIX" LC_ALL= $ ~/source/GNU/findutils/cvs/fixbug/compile/find/find . -name '*' . ./triangle.logo ./carr?.logo ./enroul?.logo ./exemple1.logo At this point it dawns on me that 0351 is a valid Latin-1 character, and indeed is a valid Unicode character (representing the same glyph). However, it's not a valid UTF-8 encoding byte. The value 0351 is 11101001 in binary, and this is an escape character in UTF8:- 0x00000800 - 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx The filenames above don't have a 10xxxxxx byte following the accented E, and so I suppose the explanation is that in my locale, those filenames have invalid multibyte character sequences in them. The current "locate" of findutils is also affected because it also uses fnmatch(); however, if I recode the input of fnmatch() to be in UTF-8 instead of Latin-1 (by using iconv on find's output before feeding it to frcode), then the glob characters now match the filenames. The same problem appears not to affect the gnulib regex module:- $ ~/source/GNU/findutils/cvs/fixbug/compile/find/find . -name '*' . ./triangle.logo ./exemple1.logo $ ~/source/GNU/findutils/cvs/fixbug/compile/find/find . -regex '.*' . ./triangle.logo ./carr?.logo ./enroul?.logo ./exemple1.logo Any ideas/suggestions? Is this problem unavoidable? Regards, James Youngman. _______________________________________________ bug-gnulib mailing list bug-gnulib@gnu.org http://lists.gnu.org/mailman/listinfo/bug-gnulib