On Tue, Apr 15, 2025 at 06:34:35PM -0400, Nikolaos Chatzikonstantinou wrote:
> > Since there's already another long thread on how m4 does not match
> > current emacs regex but why enabling intervals would break at least
> > autoconf 2.72, I'm inclined to update the m4 manual rather than use
> > RE_SYNTAX_EMACS, whether or not this patch is accepted.
> 
> I'm having a bit of issue following, but this is relevant to me, so
> I'd like to ask the following questions:
> 
> 1) <regex.h> has two interfaces, the old glibc one that gnulib
> implements and the POSIX one with regcomp() and regexec(). What I've
> noticed is inconsistency between the two interfaces in syntax:
> 
>     # m4 regexp that matches:
>     regexp(`foo', `[a-z]+')
> 
> This will not match with POSIX:
> 
>     regcomp(&re, "[a-z]+", 0);
>     assert(regexec(&re, "foo", 0, NULL, 0) == REG_NOMATCH);
> 
> The reason is that POSIX BRE wants [a-z]\+ instead. So the question
> is, does this mean the two interfaces have incompatible syntaxes?

m4 uses re_compile_pattern() with syntax 0 (which at one point used to
be RE_SYNTAX_EMACS, but this thread shows that is no longer the case).
regcomp() is a POSIX interface, but it basically forces a syntax of
either RE_SYNTAX_POSIX_EXTENDED or RE_SYNTAX_POSIX_BASIC.

The re_compile_pattern() interface is superior: it offers greater
flexibility to the user, and is a superset of the regcomp() interface
(which can only choose between two syntax levels, rather than the
wider range of re_compile_pattern() syntaxes and individual feature
knobs).

> I
> don't think that's clarified in either the glibc manual
> <https://www.gnu.org/software/libc/manual/html_node/Regular-Expressions.html>
> and gnulib's
> <https://www.gnu.org/software/gnulib/manual/html_node/The-Backslash-Character.html>.
> Perhaps
> gnulib should be agnostic of this issue (although worth a mention?)
> but certainly glibc should mention it.

Gnulib does have a way to list ALL of the regex flavors; the
regexprops-generic module creates doc/regexprops-generic.texi as a
drop-in chapter to any larger project's manual that exposes the choice
of syntax to the end user.  And GNU findutils does just that (you can
use 'find --regextype=...' with 'emacs', 'posix-awk', 'posix-basic',
'posix-egrep', 'posix-extended'):
https://www.gnu.org/software/findutils/manual/html_mono/find.html#Regular-Expressions

This thread deals with the fact that 'emacs' syntax has changed over
the years (prior to 2001, it did not have intervals or character
classes; nowadays emacs has those but programs using syntax 0 like m4
do not).

And one thought is that a future m4 may also expose the ability to
choose syntax from this same set.

Meanwhile, I have already patched the upcoming GNU m4 1.4.20 manual to
be a bit more specific about the syntax it does support, without
changing the syntax (1.4.x should remain backwards-compatible; any
changes to syntax or the ability to let the user control syntaxes
rather than a single syntax being hard-coded would be new to 1.6 or
2.0).
https://git.sv.gnu.org/cgit/m4.git/commit/?h=branch-1.4&id=c8a6346c

> 
> 2) Is there going to be a change planned in either gnulib, glibc, or
> m4 in terms of the regex syntax? If m4 breaks backwards compatibility,
> how will all the m4 scripts be fixed? Isn't it nontrivial?

The current discussion is on fixing gnulib so that 'emacs' syntax and
syntax 0 are no longer synonymous (ie., make 'emacs' syntax actually
match what emacs has done since 2001); this fix is currently
independent of glibc, although glibc will likely be changed soon and
gnulib go back to mirroring glibc.

Changing m4 syntax is not trivial.  That's why m4 1.4.20 will still be
syntax 0 (no change), but will attempt to document the situation
better.  I'm struggling to even figure out how to make m4 make it easy
to diagnose scripts that use \{ non-portably, so that it becomes
possible to opt-in to warnings about a regex that may compile
differently in the future (alas, m4's debugmode() builtin macro is not
yet easily extensible, and changing that also risks
backwards-compatibility headaches).

> 
> 3) What syntax does m4 follow after all? Should it be called the Emacs
> syntax or will that passage be changed from the manual?

That passage will be changed for 1.4.20 (see above).  It is the
pre-2001 emacs syntax, aka syntax 0.

If you want a quicker table, I can attempt to provide one (note that
POSIX BRE do not actually have to support \+ \? or \|; but glibc and
gnulib's implementation does):

   feature   0-or-more 1-or-more 1-or-0 grouping/alternation intervals 
charclasses
syntax
0 (old emacs)     *       +        ?       \( \| \)            n/a      n/a
emacs             *       +        ?       \( \| \)           \{ \}    [[:...:]]
posix-basic       *      \+       \?       \( \| \)           \{ \}    [[:...:]]
posix-extended    *       +        ?        (  |  )            {  }    [[:...:]]

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org


Reply via email to