On Tue, Nov 05, 2013 at 08:17:15AM -0800, Jim Meyering wrote: ... > > Hi Dave, > > I agree, and so does pcregrep. There are a few other problems with > grep's PCRE driver code: for example, a problem (no matter how serious) > in one file should not cause the entire grep run to exit; grep should > continue processing remaining files. And when grep reports the problem, > it should include at least the file name in the diagnostic. > > I will fix those before the upcoming snapshot. > > Thanks, > Jim > > >
Hi there, This bug was also reported in Debian ( http://bugs.debian.org/730472 ). Taking a look on it, I think the most suitable solution for the moment is to flag PCRE_NO_UTF8_CHECK instead of PCRE_UTF8, so PCRE does not check if inputs are UTF8 valid. Resulting behavior is similar to pre-grep-2.15. (See 15758-PCRE-no-check-UTF8.patch) $ grep -Pr "DEFINE" /usr/lib/linux-kbuild-3.2/ /usr/lib/linux-kbuild-3.2/scripts/kernel-doc: if ($prototype =~ m/DEFINE_SINGLE_EVENT\((.*?),/) { /usr/lib/linux-kbuild-3.2/scripts/kernel-doc: if ($prototype =~ m/DEFINE_EVENT\((.*?),(.*?),/) { /usr/lib/linux-kbuild-3.2/scripts/kernel-doc:## if ($prototype =~ m/SYSCALL_DEFINE0\s*\(\s*(a-zA-Z0-9_)*\s*\)/) { ... I have also tested printing a message when a file was invalid, but the results can be annoying (15758-PCRE-no-exit-UTF8.patch), since a warning is shown even if files do not match: $ grep -Pr "DEFINE" /usr/lib/linux-kbuild-3.2/ grep: invalid UTF-8 byte sequence in input grep: invalid UTF-8 byte sequence in input grep: invalid UTF-8 byte sequence in input grep: invalid UTF-8 byte sequence in input grep: invalid UTF-8 byte sequence in input grep: invalid UTF-8 byte sequence in input ... /usr/lib/linux-kbuild-3.2/scripts/kernel-doc: if ($prototype =~ m/DEFINE_SINGLE_EVENT\((.*?),/) { /usr/lib/linux-kbuild-3.2/scripts/kernel-doc: if ($prototype =~ m/DEFINE_EVENT\((.*?),(.*?),/) { /usr/lib/linux-kbuild-3.2/scripts/kernel-doc:## if ($prototype =~ m/SYSCALL_DEFINE0\s*\(\s*(a-zA-Z0-9_)*\s*\)/) { ... I propose 15758-PCRE-no-check-UTF8.patch as solution, at least temporal. Regards, Santiago
diff --git a/src/pcresearch.c b/src/pcresearch.c index 9ba1227..939e8d6 100644 --- a/src/pcresearch.c +++ b/src/pcresearch.c @@ -62,7 +62,7 @@ Pcompile (char const *pattern, size_t size) #if defined HAVE_LANGINFO_CODESET if (STREQ (nl_langinfo (CODESET), "UTF-8")) - flags |= PCRE_UTF8; + flags |= PCRE_NO_UTF8_CHECK; #endif /* FIXME: Remove these restrictions. */
diff --git a/src/pcresearch.c b/src/pcresearch.c index 9ba1227..8002507 100644 --- a/src/pcresearch.c +++ b/src/pcresearch.c @@ -186,8 +186,9 @@ Pexecute (char const *buf, size_t size, size_t *match_size, _("exceeded PCRE's backtracking limit")); case PCRE_ERROR_BADUTF8: - error (EXIT_TROUBLE, 0, + error (0, 0, _("invalid UTF-8 byte sequence in input")); + break; default: /* For now, we lump all remaining PCRE failures into this basket.