I think, PCRE does not have this problem. What about implementing '-regextype pcre' ? Dou You think it's a good idea?
Regards, 2017-12-19 23:01 GMT+01:00 Piotr Gackiewicz <p.gackiew...@gmail.com>: > So, if it's by-design, users should be explicitly warned in man page: > > Do not rely on -regex, you could miss some badly encoded filenames. > > Or perhaps find could be enhanced with another regextype, non-posix and > matching those ;-). > > Regards, > > > 2017-12-19 22:48 GMT+01:00 Eric Blake <ebl...@redhat.com>: > >> On 12/19/2017 03:31 PM, Bernhard Voelker wrote: >> >> >>> The test case in your attachment is a bit different, but also shows >>> the problem. It seems that gnulib's regex does not find a match for >>> the pattern '.*\.exe$' for the files in the following directory: >>> >>> $ LC_ALL=C /usr/bin/ls -log htdocs >>> ... >>> drwxr-xr-x 2 4096 Dec 18 20:45 'Zielona G'$'\363''ra' >>> ... >>> >>> I'm not an expert on UTF and regex, but it seems that the $'\363' >>> character is not matched by the dot '.' meta character in your >>> locale. >>> >> >> POSIX says that regex only has to match characters (in particular, the >> glob '.' matches characters, not encoding errors). If you pick a locale >> with multibyte characters that are subject to encoding errors when >> processing random bytes (as is the case when using a UTF-8 locale to >> process single-byte ISO filenames), then POSIX says regex behavior is >> undefined. So while it is indeed annoying that find can't match files with >> encoding errors, it is somewhat expected behavior, because there's no sane >> way to make regex well-specified on encoding errors. >> >> -- >> Eric Blake, Principal Software Engineer >> Red Hat, Inc. +1-919-301-3266 >> Virtualization: qemu.org | libvirt.org >> > > > > -- > Piotr Gackiewicz > -- Piotr Gackiewicz