[bug-gnulib] Handling of invalid multibyte character sequences in fnmatch()

James Youngman Sun, 05 Jun 2005 22:46:45 -0700

Hello,

I have filenames on my system that are in latin1; these are installed
as part of my distribution.  However, I have my environment set up for
UTF-8.


This appears to bring about a situation where gnulib's fnmatch()
function fails to match some characters with '?' and '*'.  The problem
appears to affect current gnulib, but also coreutils 5.2.1, but not
bash 3.00.16(1).

Here's an example with coreutils:-

$ ls | od -c
0000000   c   a   r   r 351   .   l   o   g   o  \n   e   n   r   o   u
0000020   l 351   .   l   o   g   o  \n   e   x   e   m   p   l   e   1
0000040   .   l   o   g   o  \n   t   r   i   a   n   g   l   e   .   l
0000060   o   g   o  \n
0000064
$ ls -1 --ignore='*'
carr?.logo
enroul?.logo
$ ls -1
carr?.logo
enroul?.logo
exemple1.logo
triangle.logo

In the above example, you will see that the filenames containing byte
0351 (octal), "LATIN SMALL LETTER E WITH ACUTE" in latin1, don't match
the glob character '*'.  Here's an example with current gnulib:-

$ ~/source/GNU/findutils/cvs/fixbug/compile/find/find . -name '*' .
./triangle.logo
./exemple1.logo
$ ~/source/GNU/findutils/cvs/fixbug/compile/find/find .
.
./triangle.logo
./carr?.logo
./enroul?.logo
./exemple1.logo

However, bash does not seem to be affected:-

$ ls -1 *
carr?.logo
enroul?.logo
exemple1.logo
triangle.logo
$ locale
LANG=en_GB.UTF-8
LC_CTYPE="en_GB.UTF-8"
LC_NUMERIC="en_GB.UTF-8"
LC_TIME="en_GB.UTF-8"
LC_COLLATE="en_GB.UTF-8"
LC_MONETARY="en_GB.UTF-8"
LC_MESSAGES="en_GB.UTF-8"
LC_PAPER="en_GB.UTF-8"
LC_NAME="en_GB.UTF-8"
LC_ADDRESS="en_GB.UTF-8"
LC_TELEPHONE="en_GB.UTF-8"
LC_MEASUREMENT="en_GB.UTF-8"
LC_IDENTIFICATION="en_GB.UTF-8"
LC_ALL=

Perhaps bash either isn't sensitive to whatever configuration error I
have made, or it uses glob() or similar, instead of gnulib's
fnmatch().

If I switch back to the C locale, the problem does not occur:-

$ unset LANG ; locale
LANG=POSIX
LC_CTYPE="POSIX"
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=
$ ~/source/GNU/findutils/cvs/fixbug/compile/find/find . -name '*'
.
./triangle.logo
./carr?.logo
./enroul?.logo
./exemple1.logo

At this point it dawns on me that 0351 is a valid Latin-1 character,
and indeed is a valid Unicode character (representing the same glyph).
However, it's not a valid UTF-8 encoding byte.  The value 0351 is
11101001 in binary, and this is an escape character in UTF8:-

       0x00000800 - 0x0000FFFF:
           1110xxxx 10xxxxxx 10xxxxxx

The filenames above don't have a 10xxxxxx byte following the accented
E, and so I suppose the explanation is that in my locale, those
filenames have invalid multibyte character sequences in them.  The
current "locate" of findutils is also affected because it also uses
fnmatch(); however, if I recode the input of fnmatch() to be in UTF-8
instead of Latin-1 (by using iconv on find's output before feeding it
to frcode), then the glob characters now match the filenames.

The same problem appears not to affect the gnulib regex module:-

$ ~/source/GNU/findutils/cvs/fixbug/compile/find/find . -name '*'
.
./triangle.logo
./exemple1.logo
$ ~/source/GNU/findutils/cvs/fixbug/compile/find/find . -regex '.*'
.
./triangle.logo
./carr?.logo
./enroul?.logo
./exemple1.logo

Any ideas/suggestions?  Is this problem unavoidable?  

Regards,
James Youngman.


_______________________________________________
bug-gnulib mailing list
bug-gnulib@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-gnulib

[bug-gnulib] Handling of invalid multibyte character sequences in fnmatch()

Reply via email to