Re: \{n\} are not recognized as repetition counter in regular expressions.

Eric Blake Mon, 07 Apr 2025 13:09:41 -0700

On Mon, Apr 07, 2025 at 10:51:26AM -0400, Zack Weinberg wrote:
> On Fri, Apr 4, 2025, at 3:56 PM, Eric Blake wrote:
> …
> > tl;dr: if I do add intervals to m4 regex, would you rather it be \{\}
> > (BRE and emacs style) or {} (ERE style)?  And how to avoid breaking
> > existing m4 scripts?
> 
> (Note: replying to chunks of your message out of original order.)
> 
> With my autoconf hat on: The *safest* thing to do, I think, would be
> to leave the existing regex syntax strictly alone until a mechanism
> for specifying the regex syntax on a per-regex basis is available.
> 
> If any changes are made to the existing syntax, I would strongly
> prefer that it be harmonized with POSIX BREs (i.e. ( + ? | { are
> all literals, \( \+ \? \| \{ are all operators) rather than with Emacs…
> 
> > For some contrast, in BRE (POSIX basic regular expression), all of (,
> > +, ?, |, and { are literals, while \( is grouping, \{ is intevals, and
> > \+, \?, and \| are up to the implementation on whether they are
> > literal or meta.  In ERE (POSIX extended regular expression), all of
> > (, +, ?, |, and { are operators, while \(, \{, \?, \+ and \{ are
> > literals (as seen in 'grep' vs. 'grep -E').  The all-or-none factor is
> > a convenience to remember; anything else feels like a disservice to
> > users, if the practice is not already long-standing.
> 
> …because of this.  As a user of BREs, EREs, *and* emacs regexps for
> going on thirty years now, emacs regexps are the worst of the three,
> because they don’t have the all-or-none characteristic.  I wind up
> avoiding using + and ? at all in emacs regexps—neither as literals nor
> as operators—because I find it too difficult to remember which way
> around they work.


That resonates with me.  It is always a pain to figure out "was my
failure to match anything because I typed the regex right and nothing
matched, or because I typod it wrong by adding or forgetting \ to the
point that the regex looked for something completely different from my
desires"; and then have to write another regex or two until finding
something that DOES match to remind myself of which spelling works,
before rewriting the original intended regex.

> 
> Also, in Autoconf source code, where m4 is being used to generate a
> shell script, m4 regexps may well appear right next to regexps
> intended for use by grep, sed, awk, etc.  Therefore it is desirable
> for m4 regexps to match the regexp syntax used by those tools, which
> is usually either POSIX BRE or ERE.

Interesting point, and one where diverging from emacs may make the
most sense.  Too bad we already have \( but + in m4 regex (that mix
matches emacs, but like you say is harder to remember)

> 
> > One of the ideas on the m4 2.0 branch (which is nowhere near usable)
> > was to let users control the regex syntax at runtime, rather than
> > hard-coding to a single syntax, and then expanding the manual to
> > document the different syntax choices possible.  In addition to
> > choosing which characters are meta, there are per-match tweaks that
> > can be useful, such as the ability to choose whether newline or NUL
> > match ".", or whether the match is case-insensitive.
> 
> Autoconf’s M4sugar layer anticipates this change; that’s why its
> prefixed names for the `regexp` and `patsubst` builtins are
> `m4_bregexp` and `m4_bpatsubst`.  Long term I think it is desirable,
> but not as much as some of the other changes that have been stacked
> up for over a decade on various M4 development branches, like the
> linear-time $@ recursion work.
> 
> My initial reaction to your message was that backward compatibility
> would require the hypothetical M4 2.0 to give us _new_ builtins for
> EREs, perhaps `eregexp` and `epatsubst`.  While writing _this_ message
> it occurred to me that one could instead use flags embedded at the
> beginning of the regex itself, like Perl does with (?i) for case
> insensitive, (?x) for “expanded” notation (unescaped whitespace is not
> significant), etc.  (Stealing a subset of the Perl (?…) extensions
> would be worthwhile _anyway_.)  But the trouble with that idea is,
> right now “regexp(haystack, `(?i)NEEDLE’)” does a literal match.
> We can’t break that either!  So maybe we _should_ have `eregexp`
> and `epatsubst` as the first stepping stone away from the old syntax.
> In ERE with no extensions ‘(?‘ is a syntax error, so it’s a safe
> extension point.

The current tentative plan in the m4 2.0 branch
was to add a new fourth argument, as in:

patsubst(haystack, needle, replacement, syntax)

where syntax (if present) has to be a recognized literal (such as
"emacs", "bre", "ere").  A single builtin like that could power both
m4_bpatsubst (accept three parameters but call the underlying builtin
with the fourth slammed to "bre") and m4_epatsubst (accept three
parameters but call the underlying with the fourth slammed to "ere").

But your idea of a leading sequence IN the needle parameter (rather
than a fourth parameter) is clever.  I agree that in ERE, leading "(?"
is an invalid sequence and therefore a great way to introduce flags
(so borrowing that idea of Perl is good to remember for down the
road).  I also agree that we can't use "(\?" to introduce flags in BRE
(that legitimately means either an optional literal "(", or a literal
"(?", depending on whether \? is literal or meta in that flavor of
BRE); but we could still get away with "\(\?flags\)" for symmetry with
the ERE spelling "(?flags)"; or maybe just go with "\?flag" (with
multiple flags spelled "\?i\?xREAL" if you want both the i and x flag
on BRE REAL).

> 
> (I would not be sad if I got PCRE regexps in M4 2.0 ;-)

And since m4 2.0 also wants to support loadable modules, it would be
possible to load a PCRE module to add known literals accepted for the
fourth argument.

But whether this should be done with one builtin that must be wrapped
into convenient spellings, or with multiple builtins, is a question
for another day (as m4 2.0 is in worse shape than 1.6).

> 
> >> > BTW, the fixed m4 passes 247 tests, skips 20 tests, and fails no tests
> >> > on my system. Original m4 shows exactly the same results.
> >
> > That only means the testsuite does not cover both spellings of { and
> > \{ to the current behavior of being literals.  It doesn't tell you how
> > many other scripts might break.  So I at least tried to find possible
> > problems.
> 
> Thanks for looking.  It’s a risky change even with all the issues in
> Autotools proper addressed, because there is _so much_ badly written
> third-party autoconf scripting out there.  Once we have a coherent
> set of patches for m4+autotools I think we will need to ask for a
> test archive rebuild on one of the big Linux distributions, and
> based on how that goes we might need to scrap the idea.

At a minimum, being able to opt-in to a warning on whatever syntax we
DON'T want to be literal (while still behaving as a literal for both
syntax), and seeing where the warning fires, will go a long ways
towards getting a consistent code base, but I don't have any delusions
of it being fast (as you say, there's a LOT of poorly-written
third-party code out there, and they won't all change at the same
speed as autoconf)

> 
> If you send a fully worked out patch for M4 1.4.x to the
> autoconf-patches mailing list I will undertake to run the Autoconf and
> Automake test suites with that patch applied (and something done about
> the issues you already found).  Libtool will also need testing but I
> do not remember whether it has a very good testsuite itself.
> 
> > At BEST, all I could do for m4 1.4.20 is to add a new tri-state
> > command-line switch; default is existing behavior (both spellings are
> > silently literals), a warning mode (add a warning of the \{ spelling
> > is encountered, but still treat it as a literal), or enabled (\{ works
> > for intervals, as desired).  Autoconf would need to opt-in to the new
> > warning asap, and downstream distros would have to scrub where the
> > warning appears.  Realistically, it would probably take several years
> > before we could turn on intervals by default.  (Sadly, once software
> > is stable, it becomes glacial to improve it).  But being able to
> > opt-in to intervals in the short-term would be nice.
> 
> If the behavior of regexp and patsubst is changed at all, even with a
> command line option that’s off by default, there needs to be a
> documented way to probe what syntax a script is getting, from *inside*
> that script.  Autoconf is supposed to *not* need to be rebuilt for a
> new version of M4, and it doesn’t look easy to make either of the
> troublesome regexes you mentioned

Yeah, it becomes imperative that any new feature is fully tested to
work with:

old m4, old autoconf (no awareness of feature, nothing changes)

old m4, new autoconf (autoconf probes that feature not available, only
uses old syntax)

new m4, old autoconf (autoconf doesn't know to opt-in to feature, m4
only uses old behavior)

new m4, new autoconf (autoconf probes that feature is available, opts
in, m4 gives warning and/or exposes alternate syntax)

And relying on a configure-time test of what m4 supports on the
packager's machine is not necessarily going to work when autoconf is
run on a developer's machine with a different version of m4.  Which in
turn implies that it is desirable to be able to probe at runtime what
is supported, rather than being limited to a command-line switch.  But
while it is easy to write a runtime probe on whether "regexp([{],
[\{1\}]) results in -1 (old m4, ergo \{ is a literal) or 0 (m4 that
has enabled repetition operator semantics), it doesn't work if doing
the probe itself triggers a warning when you have only opted in to the
portability diagnosis rather than the new semantics.

Adding a new builtin is easy to probe (ifdef can tell you if the
builtin exists or not), but I've been historically reluctant to add
new builtins to m4 1.4.x for fear of breaking backwards-compatibility.
Then again, even if existing code wrote "define(foo, ...)" (rather
than the safer "define(`foo', ...)" assuming "foo" was not a builtin,
but newer m4 makes "foo" a builtin with semantics that when invoked
without args it outputs a quoted version of itself (many of GNU m4's
builtins already behave in this manner), you won't notice a change in
behavior (your script will still get your version of "foo" rather than
the new builtin), only that you will confuse anyone reading your
script who expected the new builtin's semantics.  And NEWS says I
_did_ add the "mkstemp" builtin in m4 1.4.8 based on POSIX, so it is
not unheard of for 1.4.x to gain builtins.

Other thoughts on how to make the warning opt-in?  Existing m4 has the
debugmode() builtin that can add or subtract flags at runtime (and
whether to warn about a regex being non-portable certainly _sounds_
like it is worth a debugmode flag).  But how does one probe whether it
supports new flags?  With m4 1.4.19, there is no way for a running
script to learn what the current flags are, but only to change things
to new flags.  What's more, m4 1.4.19 is noisy on any unrecognized
flag, so even if we taught m4 1.4.20 to support "debugmode(+r)" as a
way to turn ON warnings for non-portable regex, m4 1.4.19 would warn
about the unknown flag, defeating the point.

Hmm - I wonder - what would happen if I changed the semantics of
"debugmode(+)" and "debugmode(-)" from their current semantics of
being synonyms for "debugmode(+aeq)" and "debugmode(-aeq)" both with a
side effect but no output, to instead being a way to query the set of
all known flags (+) or the set of currently active flags (-), with
output as a quoted string (can't trigger accidental macro expansion if
the current set of flags happens to match a macro name) and no side
effect?  (I would prefer a new "debugmode(?)" for confessing current
flags - but once again, that warns in m4 1.4.19).  That would at least
be something you can play with - once you know whether a flag "r" is
supported, you can turn warnings about syntax in that regex on or off
as desired around any point of call with a regex.

Autoconf 2.72 starts m4 with "--debug=aflq" (bin/autom4te.in), and, at
runtime, if __m4_version__ is defined (not present in 1.4.19 but will
be present in 1.6, and maybe in 1.4.20 if we think it makes sense),
runs "debugmode(+do)" to turn on two additional new warning categories
already present in branch-1.6 (where I'm thinking of making "r" yet
another category).  Basing a decision on the presence or absence of
"__m4_version__" is brittle (after all, the autoconf philosophy is
"check features, not versions"), but not impossible.  And even if you
DON'T gate things on the existence of "__m4_version__", it might just
work to have the following in m4sugar:

m4_if(m4_debugmode([+]), [], [m4_debugmode([+aq])], [m4_debugmode([+r])])

or even:

m4_if(m4_index(m4_debugmode([+]), [r]), [-1], [m4_debugmode([+aq])],
  [m4_debugmode([+r])])

With m4 1.4.19, that has no output but the side-effect of disabling
aeq (e was already disabled on the command line); and so the third
parameter to m4_if just re-enables aq back to what was wanted, all
without any warnings.  But with the m4 1.4.20 proposal, by outputting
all the flags, m4_index will return a non-zero offset to "r" in that
string as your witness that you can then request to turn on the "r"
flag.

> 
> > lib/autoconf/general.m4:[m4_if(m4_bregexp([$1],
> > [#\|\\\|`\|\(\$\|@S|@\)\((|{|@{:@\)]), [-1],
> 
> > lib/m4sugar/m4sugar.m4:
> > [@\(\(<:\|:>\|S|\|%:\|\{:\|:\}\)\(@\)\|&t@\)],
> 
> be *indifferent* to whether { or \{ is a literal or an operator.
> 
> > [*]At any rate, I think that's a bug in autoconf (hence why I added
> > them in cc).  If the _intent_ was to match literal "#", "\", "`", and
> > the sequences "$(" and "${" (either as literals or via quadrigraphs),
> > it isn't quite doing that.
> 
> I don’t know what the intent of this was and figuring it out is going
> to take more digging than I have time for today.  Would you mind
> filing a bug report in Savannah about it so we don’t forget?

Done: https://savannah.gnu.org/support/index.php?111221

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org

Re: \{n\} are not recognized as repetition counter in regular expressions.

Reply via email to