On 12/18/2017 04:06 PM, Piotr Gackiewicz wrote:
Hello,

I have spotted bizarre bug in gnu find.
In some circumstances, find result on '-regex' search is very dependendant
on locale settings.

I have attached a zip file, with example file tree. There are two
directories in it, one's name encoded with 'utf-8'  and other -  in
iso-8859-2.

Now we run find, trying to find files matching regex '.*\.exe'

$ LANG=pl_PL.iso-8859-2 find htdocs -type f -regex '.*\.exe$' -ls
  12845718     12 -rw-rw-r--   1 gacek    gacek           2 Dec 18 15:00
htdocs/Zielona\ G\363ra/hidden_malware.exe
  12845721     12 -rw-rw-r--   1 gacek    gacek           2 Dec 18 15:00
htdocs/Zielona\ G\303\263ra/malware.exe

Never mind the output encoding, it's expected. We have luckily found two
.exe files.

But now, let's try to change locale to something more modern:
$ LANG=pl_PL.utf-8 find htdocs -type f -regex '.*\.exe$' -ls
  12845721     12 -rw-rw-r--   1 gacek    gacek           2 gru 18 15:00
htdocs/Zielona\ G\303\263ra/malware.exe

We have found only one of these files. One with iso-encoded filename is
hidden!

The test case in your attachment is a bit different, but also shows
the problem.  It seems that gnulib's regex does not find a match for
the pattern '.*\.exe$' for the files in the following directory:

  $ LC_ALL=C /usr/bin/ls -log htdocs
  ...
  drwxr-xr-x 2 4096 Dec 18 20:45 'Zielona G'$'\363''ra'
  ...

I'm not an expert on UTF and regex, but it seems that the $'\363'
character is not matched by the dot '.' meta character in your
locale.

Have a nice day,
Berny

Reply via email to