Re: \{n\} are not recognized as repetition counter in regular expressions.

Zack Weinberg Mon, 07 Apr 2025 08:28:35 -0700

On Fri, Apr 4, 2025, at 3:56 PM, Eric Blake wrote:
…
> tl;dr: if I do add intervals to m4 regex, would you rather it be \{\}
> (BRE and emacs style) or {} (ERE style)?  And how to avoid breaking
> existing m4 scripts?


(Note: replying to chunks of your message out of original order.)

With my autoconf hat on: The *safest* thing to do, I think, would be
to leave the existing regex syntax strictly alone until a mechanism
for specifying the regex syntax on a per-regex basis is available.

If any changes are made to the existing syntax, I would strongly
prefer that it be harmonized with POSIX BREs (i.e. ( + ? | { are
all literals, \( \+ \? \| \{ are all operators) rather than with Emacs…

> For some contrast, in BRE (POSIX basic regular expression), all of (,
> +, ?, |, and { are literals, while \( is grouping, \{ is intevals, and
> \+, \?, and \| are up to the implementation on whether they are
> literal or meta.  In ERE (POSIX extended regular expression), all of
> (, +, ?, |, and { are operators, while \(, \{, \?, \+ and \{ are
> literals (as seen in 'grep' vs. 'grep -E').  The all-or-none factor is
> a convenience to remember; anything else feels like a disservice to
> users, if the practice is not already long-standing.

…because of this.  As a user of BREs, EREs, *and* emacs regexps for
going on thirty years now, emacs regexps are the worst of the three,
because they don’t have the all-or-none characteristic.  I wind up
avoiding using + and ? at all in emacs regexps—neither as literals nor
as operators—because I find it too difficult to remember which way
around they work.

Also, in Autoconf source code, where m4 is being used to generate a
shell script, m4 regexps may well appear right next to regexps
intended for use by grep, sed, awk, etc.  Therefore it is desirable
for m4 regexps to match the regexp syntax used by those tools, which
is usually either POSIX BRE or ERE.

> One of the ideas on the m4 2.0 branch (which is nowhere near usable)
> was to let users control the regex syntax at runtime, rather than
> hard-coding to a single syntax, and then expanding the manual to
> document the different syntax choices possible.  In addition to
> choosing which characters are meta, there are per-match tweaks that
> can be useful, such as the ability to choose whether newline or NUL
> match ".", or whether the match is case-insensitive.

Autoconf’s M4sugar layer anticipates this change; that’s why its
prefixed names for the `regexp` and `patsubst` builtins are
`m4_bregexp` and `m4_bpatsubst`.  Long term I think it is desirable,
but not as much as some of the other changes that have been stacked
up for over a decade on various M4 development branches, like the
linear-time $@ recursion work.

My initial reaction to your message was that backward compatibility
would require the hypothetical M4 2.0 to give us _new_ builtins for
EREs, perhaps `eregexp` and `epatsubst`.  While writing _this_ message
it occurred to me that one could instead use flags embedded at the
beginning of the regex itself, like Perl does with (?i) for case
insensitive, (?x) for “expanded” notation (unescaped whitespace is not
significant), etc.  (Stealing a subset of the Perl (?…) extensions
would be worthwhile _anyway_.)  But the trouble with that idea is,
right now “regexp(haystack, `(?i)NEEDLE’)” does a literal match.
We can’t break that either!  So maybe we _should_ have `eregexp`
and `epatsubst` as the first stepping stone away from the old syntax.
In ERE with no extensions ‘(?‘ is a syntax error, so it’s a safe
extension point.

(I would not be sad if I got PCRE regexps in M4 2.0 ;-)

>> > BTW, the fixed m4 passes 247 tests, skips 20 tests, and fails no tests
>> > on my system. Original m4 shows exactly the same results.
>
> That only means the testsuite does not cover both spellings of { and
> \{ to the current behavior of being literals.  It doesn't tell you how
> many other scripts might break.  So I at least tried to find possible
> problems.

Thanks for looking.  It’s a risky change even with all the issues in
Autotools proper addressed, because there is _so much_ badly written
third-party autoconf scripting out there.  Once we have a coherent
set of patches for m4+autotools I think we will need to ask for a
test archive rebuild on one of the big Linux distributions, and
based on how that goes we might need to scrap the idea.

If you send a fully worked out patch for M4 1.4.x to the
autoconf-patches mailing list I will undertake to run the Autoconf and
Automake test suites with that patch applied (and something done about
the issues you already found).  Libtool will also need testing but I
do not remember whether it has a very good testsuite itself.

> At BEST, all I could do for m4 1.4.20 is to add a new tri-state
> command-line switch; default is existing behavior (both spellings are
> silently literals), a warning mode (add a warning of the \{ spelling
> is encountered, but still treat it as a literal), or enabled (\{ works
> for intervals, as desired).  Autoconf would need to opt-in to the new
> warning asap, and downstream distros would have to scrub where the
> warning appears.  Realistically, it would probably take several years
> before we could turn on intervals by default.  (Sadly, once software
> is stable, it becomes glacial to improve it).  But being able to
> opt-in to intervals in the short-term would be nice.

If the behavior of regexp and patsubst is changed at all, even with a
command line option that’s off by default, there needs to be a
documented way to probe what syntax a script is getting, from *inside*
that script.  Autoconf is supposed to *not* need to be rebuilt for a
new version of M4, and it doesn’t look easy to make either of the
troublesome regexes you mentioned

> lib/autoconf/general.m4:[m4_if(m4_bregexp([$1],
> [#\|\\\|`\|\(\$\|@S|@\)\((|{|@{:@\)]), [-1],

> lib/m4sugar/m4sugar.m4:
> [@\(\(<:\|:>\|S|\|%:\|\{:\|:\}\)\(@\)\|&t@\)],

be *indifferent* to whether { or \{ is a literal or an operator.

> [*]At any rate, I think that's a bug in autoconf (hence why I added
> them in cc).  If the _intent_ was to match literal "#", "\", "`", and
> the sequences "$(" and "${" (either as literals or via quadrigraphs),
> it isn't quite doing that.

I don’t know what the intent of this was and figuring it out is going
to take more digging than I have time for today.  Would you mind
filing a bug report in Savannah about it so we don’t forget?

zw

Re: \{n\} are not recognized as repetition counter in regular expressions.

Reply via email to