Autoconf developers: see below for a bug report on _AC_DEFINE_UNQUOTED

Gnulib developers, maybe you have an opinion on why regex.h
documentation disagrees with reality?

tl;dr: if I do add intervals to m4 regex, would you rather it be \{\}
(BRE and emacs style) or {} (ERE style)?  And how to avoid breaking
existing m4 scripts?

On Fri, Apr 04, 2025 at 08:47:20AM -0500, Eric Blake wrote:
> On Fri, Nov 04, 2022 at 04:25:45AM +0300, Van de Bugger wrote:
> > M4 documentation for regular expressions is extremely short:
> > https://www.gnu.org/software/m4/manual/html_node/Regexp.html
> > No regular expression syntax is explained, it just refers to GNU Emacs
> > Manual. In turn, GNU Emacs Manual: 
> > https://www.gnu.org/software/emacs/manual/html_node/emacs/Regexps.html
> > states that \{n\} is repetition counter:
> > 
> > > For example, ‘x\{4\}’ matches the string ‘xxxx’ and nothing else. 
> > 
> > However, m4 recognizes neither \{n\} nor {n}.
> 
> This has bothered me too, over the years.
> 
> Even though it would be a new feature to enable \{\}, I can't see how
> portable GNU m4 programs would have been relying on that matching
> literal left curly brace.  I'm seriously thinking about turning this
> feature on for 1.4.20.

I'm waffling.  Reading regex.h (from gnulib) clearly says that the
'emacs' mode of re_set_syntax (when re_syntax_options == 0) recognizes
bare + and ? as operators, that RE_INTERVALS==0 means both { and \{
are literals, and that RE_NO_BK_BRACES controls whether { or \{ is
magic but only when RE_INTERVALS is set.

But clearly, emacs supports \{ intervals, even though a grep of
emacs.git does not find any hits for RE_INTERVALS outside of
src/regex.[ch].  So either the comments in regex.h about 0 being emacs
syntax is wrong, or I'm totally missing how emacs supports intervals.

For some contrast, in BRE (POSIX basic regular expression), all of (,
+, ?, |, and { are literals, while \( is grouping, \{ is intevals, and
\+, \?, and \| are up to the implementation on whether they are
literal or meta.  In ERE (POSIX extended regular expression), all of
(, +, ?, |, and { are operators, while \(, \{, \?, \+ and \{ are
literals (as seen in 'grep' vs. 'grep -E').  The all-or-none factor is
a convenience to remember; anything else feels like a disservice to
users, if the practice is not already long-standing.

emacs syntax is closer to BRE (in that grouping uses "\(" rather than
"(", but not quite (in that "+" rather than "\+" is meta).  And since
POSIX does not mandate regex in m4 at all, having GNU M4 be more like
emacs (rather than more like BRE or ERE) is a good goal.  The fact
that emacs uses "\{" for intervals works in our favor - since we
already require \( and \|, \{ feels more like BRE (although we do have
+ rather than \+).

One of the ideas on the m4 2.0 branch (which is nowhere near usable)
was to let users control the regex syntax at runtime, rather than
hard-coding to a single syntax, and then expanding the manual to
document the different syntax choices possible.  In addition to
choosing which characters are meta, there are per-match tweaks that
can be useful, such as the ability to choose whether newline or NUL
match ".", or whether the match is case-insensitive.

> 
> > I was able to build m4 from the sources. I found regcomp.c file with
> > the regular expression compiler. The regular expression syntax is
> > controlled by re_syntax_options file scope variable, which can be set
> > by re_set_syntax function. In my experiments, re_set_syntax is not
> > called, re_syntax_options is always zero, so braces are treated
> > literally.
> > 
> > If I initialize re_syntax_options to RE_INTERVALS, e. g.:
> > 
> > reg_syntax_t re_syntax_options = RE_INTERVALS;

Yes, setting that variable (or calling re_set_syntax()) would be how
to do it.

> > 
> > m4 recognizes \{n\} as repetition counter.
> > 
> > Thus, m4 is able to recognize \{n\} as repetition counter, but (for
> > unknown to me reason) this feature is disabled. I failed to trace it
> > further.
> > 
> > m4 manual does not document if this feature can be enabled or disabled
> > at build or run time, so I assume it should be enabled, as the \{n\}
> > construct is documented in GNU Emacs Manual, referred by GNU m4 Manual.
> > 
> > BTW, the fixed m4 passes 247 tests, skips 20 tests, and fails no tests
> > on my system. Original m4 shows exactly the same results.

That only means the testsuite does not cover both spellings of { and
\{ to the current behavior of being literals.  It doesn't tell you how
many other scripts might break.  So I at least tried to find possible
problems.

A quick grep of autoconf source finds at least:

lib/autoconf/general.m4:[m4_if(m4_bregexp([$1], 
[#\|\\\|`\|\(\$\|@S|@\)\((|{|@{:@\)]), [-1],

in the definition of _AC_DEFINE_UNQUOTED.  That regex is asking for a
literal "#", "\", "`", or the concatenation of [either "$" or the
quadrigraph "@S|@" (an alias for "$")] with [literal "(|{|"
concatenated with the quadrigraph @{:@ (for "(")].  Despite my HUH[*]?
factor, this appears to WANT to match a "{" as a literal, at any rate,
since it is using bare "{" with the next byte neither a digit nor ",",
it would break if we turn on "{}" for intervals by default.

[*]At any rate, I think that's a bug in autoconf (hence why I added
them in cc).  If the _intent_ was to match literal "#", "\", "`", and
the sequences "$(" and "${" (either as literals or via quadrigraphs),
it isn't quite doing that.  The correct spelling (with added
whitespace for clarity, although the whitespace would not be part of
the regex) would be

[ # \| \\ \| ` \| \( \$ \| @S|@ \) \( | \| { \| @{:@ \) ]

But then there is this:

lib/m4sugar/m4sugar.m4:                        
[@\(\(<:\|:>\|S|\|%:\|\{:\|:\}\)\(@\)\|&t@\)],

in the definition of _m4_qlen, which is "@" followed by either [ one
of ( "<:", ":>", "S|", "%:", "{:", ":}" ) followed by "@" ], or [
"&t@" ].  The spelling chosen there is done to differentiate between
most quadrigraphs (which have expanded length 1) and @&t@ (which has
expanded length 0).  At any rate, this clearly uses "\{" to be a
literal rather than an interval, and again, with the next byte neither
a digit nor a ",", it will fail to compile if we turn on \{\} for
intervals by default.

Thus, Autoconf alone is proof that we cannot enable intervals by
default, and that any use of intervals in autoconf scripts will not be
possible until a future release of Autoconf opts-in to their use.

At BEST, all I could do for m4 1.4.20 is to add a new tri-state
command-line switch; default is existing behavior (both spellings are
silently literals), a warning mode (add a warning of the \{ spelling
is encountered, but still treat it as a literal), or enabled (\{ works
for intervals, as desired).  Autoconf would need to opt-in to the new
warning asap, and downstream distros would have to scrub where the
warning appears.  Realistically, it would probably take several years
before we could turn on intervals by default.  (Sadly, once software
is stable, it becomes glacial to improve it).  But being able to
opt-in to intervals in the short-term would be nice.

Whether it should be possible to change the syntax at runtime is even
more invasive than a tri-state command-line option.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org


Reply via email to