Re: [PATCH 2/2] regex: test for buffer overrun

Dmitry V. Levin Wed, 10 Apr 2013 21:48:18 -0700

On Sun, Mar 31, 2013 at 04:11:41PM +0100, Nix wrote:
> On 30 Jan 2013, Paul Eggert spake thusly:
> 
> > +          /* This test is from glibc bug 15078.
> > +             The test case is from Andreas Schwab in
> > +             
> > <http://www.sourceware.org/ml/libc-alpha/2013-01/msg00967.html>.
> > +          */
> > +          static char const pat[] = "[^x]x";
> > +          static char const data[] =
> > +            "\xe1\x80\x80\xe1\x80\xbb\xe1\x80\xbd\xe1\x80\x94\xe1\x80"
> > +            "\xba\xe1\x80\xaf\xe1\x80\x95\xe1\x80\xbax";
> > +          re_set_syntax (0);
> > +          memset (&regex, 0, sizeof regex);
> > +          s = re_compile_pattern (pat, sizeof pat - 1, &regex);
> > +          if (s)
> > +            result |= 1;
> > +          else if (re_search (&regex, data, sizeof data - 1,
> > +                              0, sizeof data - 1, 0)
> > +                   != 21)
> > +            result |= 1;
> 
> I note that a glibc 2.17 with 7e2f0d2d77e4bc273fe00f99d970605d8e38d4d6
> and a445af0bc722d620afed7683cd320c0e4c7c6059 (Andreas's fix) applied
> does not crash on this test -- but does not appear to work as it expects
> (or as I'd expect) either, returning 0.


The data used for this test is not an arbitrary 25-byte sequence, it is
actually an 8-character collating element
<U1000><U103B><U103D><U1014><U103A><U102F><U1015><U103A>
defined in localedata/locales/iso14651_t1_common as <MM_FirstPerson>,
followed by character 'x'.

According to POSIX, "It is unspecified whether a non-matching list
expression matches a multi-character collating element that is not matched
by any of the expressions."

The difference between glibc and gnulib implementations of re_search() is
that glibc's re_search() matches multi-character collating elements in
that case while gnulib's re_search() doesn't.

For details see e.g. re_string_elem_size_at().

I think gnulib's tests/test-regex.c should allow glibc's re_search()
behavior so that various utilities built --without-included-regex
wouldn't be penalized by test-regex:

diff --git a/tests/test-regex.c b/tests/test-regex.c
index 5a94c14..15542f4 100644
--- a/tests/test-regex.c
+++ b/tests/test-regex.c
@@ -79,17 +79,28 @@ main (void)
           */
           static char const pat[] = "[^x]x";
           static char const data[] =
-            "\xe1\x80\x80\xe1\x80\xbb\xe1\x80\xbd\xe1\x80\x94\xe1\x80"
-            "\xba\xe1\x80\xaf\xe1\x80\x95\xe1\x80\xbax";
+            /* <U1000><U103B><U103D><U1014><U103A><U102F><U1015><U103A> */
+            "\xe1\x80\x80"
+            "\xe1\x80\xbb"
+            "\xe1\x80\xbd"
+            "\xe1\x80\x94"
+            "\xe1\x80\xba"
+            "\xe1\x80\xaf"
+            "\xe1\x80\x95"
+            "\xe1\x80\xba"
+            "x";
           re_set_syntax (0);
           memset (&regex, 0, sizeof regex);
           s = re_compile_pattern (pat, sizeof pat - 1, &regex);
           if (s)
             result |= 1;
-          else if (re_search (&regex, data, sizeof data - 1,
-                              0, sizeof data - 1, 0)
-                   != 21)
-            result |= 1;
+          else
+            {
+              i = re_search (&regex, data, sizeof data - 1,
+                             0, sizeof data - 1, 0);
+              if (i != 0 && i != 21)
+                result |= 1;
+            }
         }
 
       if (! setlocale (LC_ALL, "C"))

If the change is OK, I'll add a commit message.

> Introducing a single spurious
> character after the first byte of 'data', like so:
> 
>   static char const data[] =
>     "\xe1""a\x80\x80\xe1\x80\xbb\xe1\x80\xbd\xe1\x80\x94\xe1\x80"
>     "\xba\xe1\x80\xaf\xe1\x80\x95\xe1\x80\xbax";
> 
> changes the return value of re_search() to 16, still not right.
> (This is with locale set to en_US.UTF-8, just as in the original glibc
> testcase, which *does* pass.)

Your "single spurious character" broke that long collating element.  The
last 3 characters of it (<U102F><U1015><U103A>) appeared to be another
collating element called <MMV102FK1015A>, so glibc's re_search() is
perfectly correct to return 1+5*3 in that case.


-- 
ldv

pgpkgEAGnb2jg.pgp
Description: PGP signature

Re: [PATCH 2/2] regex: test for buffer overrun

Reply via email to