On Tue, Nov 05, 2013 at 08:17:15AM -0800, Jim Meyering wrote:
...
> 
> Hi Dave,
> 
> I agree, and so does pcregrep.  There are a few other problems with
> grep's PCRE driver code: for example, a problem (no matter how serious)
> in one file should not cause the entire grep run to exit; grep should
> continue processing remaining files. And when grep reports the problem,
> it should include at least the file name in the diagnostic.
> 
> I will fix those before the upcoming snapshot.
> 
> Thanks,
> Jim
> 
> 
> 

Hi there,

This bug was also reported in Debian ( http://bugs.debian.org/730472 ).

Taking a look on it, I think the most suitable solution for the moment
is to flag PCRE_NO_UTF8_CHECK instead of PCRE_UTF8, so
PCRE does not check if inputs are UTF8 valid. Resulting behavior is
similar to pre-grep-2.15. (See 15758-PCRE-no-check-UTF8.patch)

$ grep -Pr "DEFINE" /usr/lib/linux-kbuild-3.2/
/usr/lib/linux-kbuild-3.2/scripts/kernel-doc:   if ($prototype =~ 
m/DEFINE_SINGLE_EVENT\((.*?),/) {
/usr/lib/linux-kbuild-3.2/scripts/kernel-doc:   if ($prototype =~ 
m/DEFINE_EVENT\((.*?),(.*?),/) {
/usr/lib/linux-kbuild-3.2/scripts/kernel-doc:## if ($prototype =~ 
m/SYSCALL_DEFINE0\s*\(\s*(a-zA-Z0-9_)*\s*\)/) {
...


I have also tested printing a message when a file was invalid, but the results
can be annoying (15758-PCRE-no-exit-UTF8.patch), since a warning is shown even
if files do not match:

$ grep -Pr "DEFINE" /usr/lib/linux-kbuild-3.2/
grep: invalid UTF-8 byte sequence in input
grep: invalid UTF-8 byte sequence in input
grep: invalid UTF-8 byte sequence in input
grep: invalid UTF-8 byte sequence in input
grep: invalid UTF-8 byte sequence in input
grep: invalid UTF-8 byte sequence in input
...
/usr/lib/linux-kbuild-3.2/scripts/kernel-doc:   if ($prototype =~ 
m/DEFINE_SINGLE_EVENT\((.*?),/) {
/usr/lib/linux-kbuild-3.2/scripts/kernel-doc:   if ($prototype =~ 
m/DEFINE_EVENT\((.*?),(.*?),/) {
/usr/lib/linux-kbuild-3.2/scripts/kernel-doc:## if ($prototype =~ 
m/SYSCALL_DEFINE0\s*\(\s*(a-zA-Z0-9_)*\s*\)/) {
...

I propose 15758-PCRE-no-check-UTF8.patch as solution, at least temporal.

Regards,

Santiago

diff --git a/src/pcresearch.c b/src/pcresearch.c
index 9ba1227..939e8d6 100644
--- a/src/pcresearch.c
+++ b/src/pcresearch.c
@@ -62,7 +62,7 @@ Pcompile (char const *pattern, size_t size)
 
 #if defined HAVE_LANGINFO_CODESET
   if (STREQ (nl_langinfo (CODESET), "UTF-8"))
-    flags |= PCRE_UTF8;
+    flags |= PCRE_NO_UTF8_CHECK;
 #endif
 
   /* FIXME: Remove these restrictions.  */
diff --git a/src/pcresearch.c b/src/pcresearch.c
index 9ba1227..8002507 100644
--- a/src/pcresearch.c
+++ b/src/pcresearch.c
@@ -186,8 +186,9 @@ Pexecute (char const *buf, size_t size, size_t *match_size,
                  _("exceeded PCRE's backtracking limit"));
 
         case PCRE_ERROR_BADUTF8:
-          error (EXIT_TROUBLE, 0,
+          error (0, 0,
                  _("invalid UTF-8 byte sequence in input"));
+          break;
 
         default:
           /* For now, we lump all remaining PCRE failures into this basket.

Reply via email to