On Sun, Mar 31, 2013 at 04:11:41PM +0100, Nix wrote: > On 30 Jan 2013, Paul Eggert spake thusly: > > > + /* This test is from glibc bug 15078. > > + The test case is from Andreas Schwab in > > + > > <http://www.sourceware.org/ml/libc-alpha/2013-01/msg00967.html>. > > + */ > > + static char const pat[] = "[^x]x"; > > + static char const data[] = > > + "\xe1\x80\x80\xe1\x80\xbb\xe1\x80\xbd\xe1\x80\x94\xe1\x80" > > + "\xba\xe1\x80\xaf\xe1\x80\x95\xe1\x80\xbax"; > > + re_set_syntax (0); > > + memset (®ex, 0, sizeof regex); > > + s = re_compile_pattern (pat, sizeof pat - 1, ®ex); > > + if (s) > > + result |= 1; > > + else if (re_search (®ex, data, sizeof data - 1, > > + 0, sizeof data - 1, 0) > > + != 21) > > + result |= 1; > > I note that a glibc 2.17 with 7e2f0d2d77e4bc273fe00f99d970605d8e38d4d6 > and a445af0bc722d620afed7683cd320c0e4c7c6059 (Andreas's fix) applied > does not crash on this test -- but does not appear to work as it expects > (or as I'd expect) either, returning 0.
The data used for this test is not an arbitrary 25-byte sequence, it is actually an 8-character collating element <U1000><U103B><U103D><U1014><U103A><U102F><U1015><U103A> defined in localedata/locales/iso14651_t1_common as <MM_FirstPerson>, followed by character 'x'. According to POSIX, "It is unspecified whether a non-matching list expression matches a multi-character collating element that is not matched by any of the expressions." The difference between glibc and gnulib implementations of re_search() is that glibc's re_search() matches multi-character collating elements in that case while gnulib's re_search() doesn't. For details see e.g. re_string_elem_size_at(). I think gnulib's tests/test-regex.c should allow glibc's re_search() behavior so that various utilities built --without-included-regex wouldn't be penalized by test-regex: diff --git a/tests/test-regex.c b/tests/test-regex.c index 5a94c14..15542f4 100644 --- a/tests/test-regex.c +++ b/tests/test-regex.c @@ -79,17 +79,28 @@ main (void) */ static char const pat[] = "[^x]x"; static char const data[] = - "\xe1\x80\x80\xe1\x80\xbb\xe1\x80\xbd\xe1\x80\x94\xe1\x80" - "\xba\xe1\x80\xaf\xe1\x80\x95\xe1\x80\xbax"; + /* <U1000><U103B><U103D><U1014><U103A><U102F><U1015><U103A> */ + "\xe1\x80\x80" + "\xe1\x80\xbb" + "\xe1\x80\xbd" + "\xe1\x80\x94" + "\xe1\x80\xba" + "\xe1\x80\xaf" + "\xe1\x80\x95" + "\xe1\x80\xba" + "x"; re_set_syntax (0); memset (®ex, 0, sizeof regex); s = re_compile_pattern (pat, sizeof pat - 1, ®ex); if (s) result |= 1; - else if (re_search (®ex, data, sizeof data - 1, - 0, sizeof data - 1, 0) - != 21) - result |= 1; + else + { + i = re_search (®ex, data, sizeof data - 1, + 0, sizeof data - 1, 0); + if (i != 0 && i != 21) + result |= 1; + } } if (! setlocale (LC_ALL, "C")) If the change is OK, I'll add a commit message. > Introducing a single spurious > character after the first byte of 'data', like so: > > static char const data[] = > "\xe1""a\x80\x80\xe1\x80\xbb\xe1\x80\xbd\xe1\x80\x94\xe1\x80" > "\xba\xe1\x80\xaf\xe1\x80\x95\xe1\x80\xbax"; > > changes the return value of re_search() to 16, still not right. > (This is with locale set to en_US.UTF-8, just as in the original glibc > testcase, which *does* pass.) Your "single spurious character" broke that long collating element. The last 3 characters of it (<U102F><U1015><U103A>) appeared to be another collating element called <MMV102FK1015A>, so glibc's re_search() is perfectly correct to return 1+5*3 in that case. -- ldv
pgpkgEAGnb2jg.pgp
Description: PGP signature