Autoconf developers: see below for a bug report on _AC_DEFINE_UNQUOTED Gnulib developers, maybe you have an opinion on why regex.h documentation disagrees with reality?
tl;dr: if I do add intervals to m4 regex, would you rather it be \{\} (BRE and emacs style) or {} (ERE style)? And how to avoid breaking existing m4 scripts? On Fri, Apr 04, 2025 at 08:47:20AM -0500, Eric Blake wrote: > On Fri, Nov 04, 2022 at 04:25:45AM +0300, Van de Bugger wrote: > > M4 documentation for regular expressions is extremely short: > > https://www.gnu.org/software/m4/manual/html_node/Regexp.html > > No regular expression syntax is explained, it just refers to GNU Emacs > > Manual. In turn, GNU Emacs Manual: > > https://www.gnu.org/software/emacs/manual/html_node/emacs/Regexps.html > > states that \{n\} is repetition counter: > > > > > For example, ‘x\{4\}’ matches the string ‘xxxx’ and nothing else. > > > > However, m4 recognizes neither \{n\} nor {n}. > > This has bothered me too, over the years. > > Even though it would be a new feature to enable \{\}, I can't see how > portable GNU m4 programs would have been relying on that matching > literal left curly brace. I'm seriously thinking about turning this > feature on for 1.4.20. I'm waffling. Reading regex.h (from gnulib) clearly says that the 'emacs' mode of re_set_syntax (when re_syntax_options == 0) recognizes bare + and ? as operators, that RE_INTERVALS==0 means both { and \{ are literals, and that RE_NO_BK_BRACES controls whether { or \{ is magic but only when RE_INTERVALS is set. But clearly, emacs supports \{ intervals, even though a grep of emacs.git does not find any hits for RE_INTERVALS outside of src/regex.[ch]. So either the comments in regex.h about 0 being emacs syntax is wrong, or I'm totally missing how emacs supports intervals. For some contrast, in BRE (POSIX basic regular expression), all of (, +, ?, |, and { are literals, while \( is grouping, \{ is intevals, and \+, \?, and \| are up to the implementation on whether they are literal or meta. In ERE (POSIX extended regular expression), all of (, +, ?, |, and { are operators, while \(, \{, \?, \+ and \{ are literals (as seen in 'grep' vs. 'grep -E'). The all-or-none factor is a convenience to remember; anything else feels like a disservice to users, if the practice is not already long-standing. emacs syntax is closer to BRE (in that grouping uses "\(" rather than "(", but not quite (in that "+" rather than "\+" is meta). And since POSIX does not mandate regex in m4 at all, having GNU M4 be more like emacs (rather than more like BRE or ERE) is a good goal. The fact that emacs uses "\{" for intervals works in our favor - since we already require \( and \|, \{ feels more like BRE (although we do have + rather than \+). One of the ideas on the m4 2.0 branch (which is nowhere near usable) was to let users control the regex syntax at runtime, rather than hard-coding to a single syntax, and then expanding the manual to document the different syntax choices possible. In addition to choosing which characters are meta, there are per-match tweaks that can be useful, such as the ability to choose whether newline or NUL match ".", or whether the match is case-insensitive. > > > I was able to build m4 from the sources. I found regcomp.c file with > > the regular expression compiler. The regular expression syntax is > > controlled by re_syntax_options file scope variable, which can be set > > by re_set_syntax function. In my experiments, re_set_syntax is not > > called, re_syntax_options is always zero, so braces are treated > > literally. > > > > If I initialize re_syntax_options to RE_INTERVALS, e. g.: > > > > reg_syntax_t re_syntax_options = RE_INTERVALS; Yes, setting that variable (or calling re_set_syntax()) would be how to do it. > > > > m4 recognizes \{n\} as repetition counter. > > > > Thus, m4 is able to recognize \{n\} as repetition counter, but (for > > unknown to me reason) this feature is disabled. I failed to trace it > > further. > > > > m4 manual does not document if this feature can be enabled or disabled > > at build or run time, so I assume it should be enabled, as the \{n\} > > construct is documented in GNU Emacs Manual, referred by GNU m4 Manual. > > > > BTW, the fixed m4 passes 247 tests, skips 20 tests, and fails no tests > > on my system. Original m4 shows exactly the same results. That only means the testsuite does not cover both spellings of { and \{ to the current behavior of being literals. It doesn't tell you how many other scripts might break. So I at least tried to find possible problems. A quick grep of autoconf source finds at least: lib/autoconf/general.m4:[m4_if(m4_bregexp([$1], [#\|\\\|`\|\(\$\|@S|@\)\((|{|@{:@\)]), [-1], in the definition of _AC_DEFINE_UNQUOTED. That regex is asking for a literal "#", "\", "`", or the concatenation of [either "$" or the quadrigraph "@S|@" (an alias for "$")] with [literal "(|{|" concatenated with the quadrigraph @{:@ (for "(")]. Despite my HUH[*]? factor, this appears to WANT to match a "{" as a literal, at any rate, since it is using bare "{" with the next byte neither a digit nor ",", it would break if we turn on "{}" for intervals by default. [*]At any rate, I think that's a bug in autoconf (hence why I added them in cc). If the _intent_ was to match literal "#", "\", "`", and the sequences "$(" and "${" (either as literals or via quadrigraphs), it isn't quite doing that. The correct spelling (with added whitespace for clarity, although the whitespace would not be part of the regex) would be [ # \| \\ \| ` \| \( \$ \| @S|@ \) \( | \| { \| @{:@ \) ] But then there is this: lib/m4sugar/m4sugar.m4: [@\(\(<:\|:>\|S|\|%:\|\{:\|:\}\)\(@\)\|&t@\)], in the definition of _m4_qlen, which is "@" followed by either [ one of ( "<:", ":>", "S|", "%:", "{:", ":}" ) followed by "@" ], or [ "&t@" ]. The spelling chosen there is done to differentiate between most quadrigraphs (which have expanded length 1) and @&t@ (which has expanded length 0). At any rate, this clearly uses "\{" to be a literal rather than an interval, and again, with the next byte neither a digit nor a ",", it will fail to compile if we turn on \{\} for intervals by default. Thus, Autoconf alone is proof that we cannot enable intervals by default, and that any use of intervals in autoconf scripts will not be possible until a future release of Autoconf opts-in to their use. At BEST, all I could do for m4 1.4.20 is to add a new tri-state command-line switch; default is existing behavior (both spellings are silently literals), a warning mode (add a warning of the \{ spelling is encountered, but still treat it as a literal), or enabled (\{ works for intervals, as desired). Autoconf would need to opt-in to the new warning asap, and downstream distros would have to scrub where the warning appears. Realistically, it would probably take several years before we could turn on intervals by default. (Sadly, once software is stable, it becomes glacial to improve it). But being able to opt-in to intervals in the short-term would be nice. Whether it should be possible to change the syntax at runtime is even more invasive than a tri-state command-line option. -- Eric Blake, Principal Software Engineer Red Hat, Inc. Virtualization: qemu.org | libguestfs.org