On 12/18/2017 04:06 PM, Piotr Gackiewicz wrote:
Hello,
I have spotted bizarre bug in gnu find.
In some circumstances, find result on '-regex' search is very dependendant
on locale settings.
I have attached a zip file, with example file tree. There are two
directories in it, one's name encoded with 'utf-8' and other - in
iso-8859-2.
Now we run find, trying to find files matching regex '.*\.exe'
$ LANG=pl_PL.iso-8859-2 find htdocs -type f -regex '.*\.exe$' -ls
12845718 12 -rw-rw-r-- 1 gacek gacek 2 Dec 18 15:00
htdocs/Zielona\ G\363ra/hidden_malware.exe
12845721 12 -rw-rw-r-- 1 gacek gacek 2 Dec 18 15:00
htdocs/Zielona\ G\303\263ra/malware.exe
Never mind the output encoding, it's expected. We have luckily found two
.exe files.
But now, let's try to change locale to something more modern:
$ LANG=pl_PL.utf-8 find htdocs -type f -regex '.*\.exe$' -ls
12845721 12 -rw-rw-r-- 1 gacek gacek 2 gru 18 15:00
htdocs/Zielona\ G\303\263ra/malware.exe
We have found only one of these files. One with iso-encoded filename is
hidden!
The test case in your attachment is a bit different, but also shows
the problem. It seems that gnulib's regex does not find a match for
the pattern '.*\.exe$' for the files in the following directory:
$ LC_ALL=C /usr/bin/ls -log htdocs
...
drwxr-xr-x 2 4096 Dec 18 20:45 'Zielona G'$'\363''ra'
...
I'm not an expert on UTF and regex, but it seems that the $'\363'
character is not matched by the dot '.' meta character in your
locale.
Have a nice day,
Berny