Bug#777556: perl: regexp performance regression since 5.18

Niko Tyni Fri, 27 Feb 2015 12:16:12 -0800

On Tue, Feb 10, 2015 at 09:48:21PM +0100, Niels Thykier wrote:
> On Mon, 9 Feb 2015 21:59:22 +0200 Niko Tyni <nt...@debian.org> wrote:
> > Package: perl
> > Version:: 5.20.1-5
> > Severity: serious
> > Tags: upstream
> > Forwarded: https://rt.perl.org/Ticket/Display.html?id=123743
> > 
> > This should possibly be considered a release critical performance
> > regression. Marking as 'serious' for now.
> > 
> >   (wheezy)$ seq -f "%01000.0f: 0c" 1000 | /usr/bin/time perl5.14.2 -ne 
> > '/.*:\s*ab/i'
> >   0.02user 0.00system 0:00.02elapsed 96%CPU (0avgtext+0avgdata 
> > 3260maxresident)k
> >   0inputs+0outputs (0major+168minor)pagefaults 0swaps
> >   
> >   (sid)$ seq -f "%01000.0f: 0c" 1000 | /usr/bin/time perl5.20.1 -ne 
> > '/.*:\s*ab/i'
> >   7.41user 0.00system 0:07.41elapsed 99%CPU (0avgtext+0avgdata 
> > 4100maxresident)k
> >   0inputs+0outputs (0major+204minor)pagefaults 0swaps


> David Mitchell pushed a patch[1] for this issue.

> [1]
> http://perl5.git.perl.org/perl.git/commitdiff/0fa70a06a98fc8fa9840d4dbaa31fc2d3b28b99b

Thanks. This needs some backporting; as t/re/speed.t doesn't exist in
maint-5.20, I think we're best off dropping that part. The regexec.c
hunks apply with just a slight comment change:

--    /*  [unless only anchor is MBOL - implying multiline is set] */
+-    /*  [unless only anchor is BOL and multiline is set] */

I've built a test 5.20.1 based package successfully with the attached
version of the patch and verified it fixes the slowdown above.
So I suppose it should go in the next upload.
-- 
Niko Tyni   nt...@debian.org

>From 241d34b5e95859ddd63233a6c7c8acd34e759d4c Mon Sep 17 00:00:00 2001
From: David Mitchell <da...@iabyn.com>
Date: Tue, 10 Feb 2015 12:17:51 +0000
Subject: simpify and speed up /.*.../ handling

See RT ##123743.

A pattern that starts /.*/ has a fake MBOL or SBOL flag added, along
with PREGf_IMPLICIT.

The idea is that, with /.*.../s, if the NFA don't match when started at
pos 0, then it's not going to match if started at any other position
either; while /.*.../ won't match at any other start position up until
the next \n.

However, the branch in regexec() that implemented this was a bit a mess
(like much in the perl core, it had gradually accreted), and caused
intuit-enabled /.*.../ and /.*...patterns to go quadratic.

The branch looked roughly like:

    if (anchored) {
        if (regtry(s)) goto success;
        if (can_intuit) {
            while (s < end) {
                s = intuit(s+1);
                if (!s) goto fail;
                if (regtry(s)) goto success;
            }
        }
        else {
            while (s < end) {
                s = skip_to_next_newline(s);
                if (regtry(s)) goto success;
            }
        }
    }

The problem is that in the presence of a .* at the start of the pattern,
intuit() will always return either NULL on failure, or the start position,
rather than any later position. So the can_intuit branch above calls
regtry() on every character position.

This commit fixes this by changing the structure of the code to be like
this, where it only tries things on newline boundaries:

    if (anchored) {
        if (regtry(s)) goto success;
        while (1) {
            s = skip_to_next_newline(s);
            if (can_intuit) {
                s = intuit(s+1);
                if (!s) goto fail;
            }
            if (regtry(s)) goto success;
        }
    }

This makes the code a lot simpler, and mostly avoids quadratic behaviour
(you can still get it with a string consisting mainly of newlines).

(backported for 5.20.1 by Niko Tyni <nt...@debian.org>)

Bug: https://rt.perl.org/Public/Bug/Display.html?id=123743
Bug-Debian: https://bugs.debian.org/777556
Origin: http://perl5.git.perl.org/perl.git/commitdiff/0fa70a06a98fc8fa9840d4dbaa31fc2d3b28b99b
Patch-Name: fixes/regexp-performance.diff
---
 regexec.c | 124 +++++++++++++++++++++++---------------------------------------
 1 file changed, 45 insertions(+), 79 deletions(-)

diff --git a/regexec.c b/regexec.c
index 362390b..4b66a94 100644
--- a/regexec.c
+++ b/regexec.c
@@ -755,7 +755,7 @@ Perl_re_intuit_start(pTHX_
 
         /* ml_anch: check after \n?
          *
-         * A note about IMPLICIT: on an un-anchored pattern beginning
+         * A note about PREGf_IMPLICIT: on an un-anchored pattern beginning
          * with /.*.../, these flags will have been added by the
          * compiler:
          *   /.*abc/, /.*abc/m:  PREGf_IMPLICIT | PREGf_ANCH_MBOL
@@ -2682,86 +2682,52 @@ Perl_regexec_flags(pTHX_ REGEXP * const rx, char *stringarg, char *strend,
 	));
     }
 
-    /* Simplest case:  anchored match need be tried only once. */
-    /*  [unless only anchor is BOL and multiline is set] */
+    /* Simplest case: anchored match need be tried only once, or with
+     * MBOL, only at the beginning of each line.
+     *
+     * Note that /.*.../ sets PREGf_IMPLICIT|MBOL, while /.*.../s sets
+     * PREGf_IMPLICIT|SBOL. The idea is that with /.*.../s, if it doesn't
+     * match at the start of the string then it won't match anywhere else
+     * either; while with /.*.../, if it doesn't match at the beginning,
+     * the earliest it could match is at the start of the next line */
+
     if (prog->intflags & (PREGf_ANCH & ~PREGf_ANCH_GPOS)) {
-	if (s == startpos && regtry(reginfo, &s))
+        char *end;
+
+	if (regtry(reginfo, &s))
 	    goto got_it;
-        else if (multiline || (prog->intflags & (PREGf_IMPLICIT | PREGf_ANCH_MBOL))) /* XXXX SBOL? */
-	{
-	    char *end;
-
-	    if (minlen)
-		dontbother = minlen - 1;
-	    end = HOP3c(strend, -dontbother, strbeg) - 1;
-	    /* for multiline we only have to try after newlines */
-	    if (prog->check_substr || prog->check_utf8) {
-                /* because of the goto we can not easily reuse the macros for bifurcating the
-                   unicode/non-unicode match modes here like we do elsewhere - demerphq */
-                if (utf8_target) {
-                    if (s == startpos)
-                        goto after_try_utf8;
-                    while (1) {
-                        if (regtry(reginfo, &s)) {
-                            goto got_it;
-                        }
-                      after_try_utf8:
-                        if (s > end) {
-                            goto phooey;
-                        }
-                        if (prog->extflags & RXf_USE_INTUIT) {
-                            s = re_intuit_start(rx, sv, strbeg,
-                                    s + UTF8SKIP(s), strend, flags, NULL);
-                            if (!s) {
-                                goto phooey;
-                            }
-                        }
-                        else {
-                            s += UTF8SKIP(s);
-                        }
-                    }
-                } /* end search for check string in unicode */
-                else {
-                    if (s == startpos) {
-                        goto after_try_latin;
-                    }
-                    while (1) {
-                        if (regtry(reginfo, &s)) {
-                            goto got_it;
-                        }
-                      after_try_latin:
-                        if (s > end) {
-                            goto phooey;
-                        }
-                        if (prog->extflags & RXf_USE_INTUIT) {
-                            s = re_intuit_start(rx, sv, strbeg,
-                                        s + 1, strend, flags, NULL);
-                            if (!s) {
-                                goto phooey;
-                            }
-                        }
-                        else {
-                            s++;
-                        }
-                    }
-                } /* end search for check string in latin*/
-	    } /* end search for check string */
-	    else { /* search for newline */
-		if (s > startpos) {
-                    /*XXX: The s-- is almost definitely wrong here under unicode - demeprhq*/
-		    s--;
-		}
-		/* We can use a more efficient search as newlines are the same in unicode as they are in latin */
-		while (s <= end) { /* note it could be possible to match at the end of the string */
-		    if (*s++ == '\n') {	/* don't need PL_utf8skip here */
-			if (regtry(reginfo, &s))
-			    goto got_it;
-		    }
-		}
-	    } /* end search for newline */
-	} /* end anchored/multiline check string search */
-	goto phooey;
-    } else if (prog->intflags & PREGf_ANCH_GPOS)
+
+        if (!(prog->intflags & PREGf_ANCH_MBOL))
+            goto phooey;
+
+        /* didn't match at start, try at other newline positions */
+
+        if (minlen)
+            dontbother = minlen - 1;
+        end = HOP3c(strend, -dontbother, strbeg) - 1;
+
+        /* skip to next newline */
+
+        while (s <= end) { /* note it could be possible to match at the end of the string */
+            /* NB: newlines are the same in unicode as they are in latin */
+            if (*s++ != '\n')
+                continue;
+            if (prog->check_substr || prog->check_utf8) {
+            /* note that with PREGf_IMPLICIT, intuit can only fail
+             * or return the start position, so it's of limited utility.
+             * Nevertheless, I made the decision that the potential for
+             * quick fail was still worth it - DAPM */
+                s = re_intuit_start(rx, sv, strbeg, s, strend, flags, NULL);
+                if (!s)
+                    goto phooey;
+            }
+            if (regtry(reginfo, &s))
+                goto got_it;
+        }
+        goto phooey;
+    } /* end anchored search */
+
+    if (prog->intflags & PREGf_ANCH_GPOS)
     {
         /* PREGf_ANCH_GPOS should never be true if PREGf_GPOS_SEEN is not true */
         assert(prog->intflags & PREGf_GPOS_SEEN);

Bug#777556: perl: regexp performance regression since 5.18

Reply via email to