Thank you so much Geoff for that very useful knowledge dump! Good call out on the .*, I realized I carried them over too, when I copy-pasted the regex from the pure vcl example (where it's needed) to the vmod one.
And so, just to be clear about it: - vmod-re is based on libpcre2 - vmod-re2 is based on libre2 Correct? I see no way I'm going to misremember that, at all :-D -- Guillaume Quintard On Fri, Sep 1, 2023 at 7:47 AM Geoff Simmons <[email protected]> wrote: > Sorry, I get nerdy about this subject and can't help following up. > > I said: > > > - pcre2 regex matching is generally faster than re2 matching. The point > > of re2 regexen is that matches won't go into catastrophic backtracking > > on pathological cases. > > Should have mentioned that pcre2 is even better at subexpression > capture, which is what the OP's question is all about. > > > sub vcl_init { > > new query_pattern = re.regex(".*(q=)(.*?)(\&|$).*"); > > } > > OMG no. Like this please: > > new query_pattern = re.regex("\b(q=)(.*?)(?:\&|$)"); > > I have sent an example of a pcre regex with .* (two of them!) to a > public mailing list, for which I will burn in hell. > > To match a name-value pair in a cookie, use a regex with \b for 'word > boundary' in front of the name. That way it will match either at the > beginning of the Cookie value, or following an ampersand. > > And ?: tells pcre not to bother capturing the last expression in > parentheses (they're just for grouping). > > Avoid .* in pcre regexen if you possibly can. You can, almost always. > > With .* at the beginning, the pcre matcher searches all the way to the > end of the string, and then backtracks all the way back, looking for the > first letter to match. In this case 'q', and it will stop and search and > backtrack at any other 'q' that it may find while working backwards. > > pcre2 fortunately has an optimization that ignores a trailing .* if it > has found a match up until there, so that it doesn't busily match the > dot against every character left in the string. So this time .* does no > harm, but it's superfluous, and violates the golden rule of pcre: avoid > .* if at all possible. > > Incidentally, this is an area where re2 does have an advantage over > pcre2. The efficiency of pcre2 matching depends crucially on how you > write the regex, because details like \b instead of .* give it hints for > pruning the search. While re2 matching usually isn't as fast as pcre2 > matching against well-written patterns, re2 doesn't depend so much on > that sort of thing. > > > OK I can chill now, > Geoff > -- > ** * * UPLEX - Nils Goroll Systemoptimierung > > Scheffelstraße 32 > 22301 Hamburg > > Tel +49 40 2880 5731 > Mob +49 176 636 90917 > Fax +49 40 42949753 > > http://uplex.de > > _______________________________________________ > varnish-misc mailing list > [email protected] > https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc >
_______________________________________________ varnish-misc mailing list [email protected] https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
