On Tue, Apr 15, 2025 at 06:34:35PM -0400, Nikolaos Chatzikonstantinou wrote: > > Since there's already another long thread on how m4 does not match > > current emacs regex but why enabling intervals would break at least > > autoconf 2.72, I'm inclined to update the m4 manual rather than use > > RE_SYNTAX_EMACS, whether or not this patch is accepted. > > I'm having a bit of issue following, but this is relevant to me, so > I'd like to ask the following questions: > > 1) <regex.h> has two interfaces, the old glibc one that gnulib > implements and the POSIX one with regcomp() and regexec(). What I've > noticed is inconsistency between the two interfaces in syntax: > > # m4 regexp that matches: > regexp(`foo', `[a-z]+') > > This will not match with POSIX: > > regcomp(&re, "[a-z]+", 0); > assert(regexec(&re, "foo", 0, NULL, 0) == REG_NOMATCH); > > The reason is that POSIX BRE wants [a-z]\+ instead. So the question > is, does this mean the two interfaces have incompatible syntaxes?
m4 uses re_compile_pattern() with syntax 0 (which at one point used to be RE_SYNTAX_EMACS, but this thread shows that is no longer the case). regcomp() is a POSIX interface, but it basically forces a syntax of either RE_SYNTAX_POSIX_EXTENDED or RE_SYNTAX_POSIX_BASIC. The re_compile_pattern() interface is superior: it offers greater flexibility to the user, and is a superset of the regcomp() interface (which can only choose between two syntax levels, rather than the wider range of re_compile_pattern() syntaxes and individual feature knobs). > I > don't think that's clarified in either the glibc manual > <https://www.gnu.org/software/libc/manual/html_node/Regular-Expressions.html> > and gnulib's > <https://www.gnu.org/software/gnulib/manual/html_node/The-Backslash-Character.html>. > Perhaps > gnulib should be agnostic of this issue (although worth a mention?) > but certainly glibc should mention it. Gnulib does have a way to list ALL of the regex flavors; the regexprops-generic module creates doc/regexprops-generic.texi as a drop-in chapter to any larger project's manual that exposes the choice of syntax to the end user. And GNU findutils does just that (you can use 'find --regextype=...' with 'emacs', 'posix-awk', 'posix-basic', 'posix-egrep', 'posix-extended'): https://www.gnu.org/software/findutils/manual/html_mono/find.html#Regular-Expressions This thread deals with the fact that 'emacs' syntax has changed over the years (prior to 2001, it did not have intervals or character classes; nowadays emacs has those but programs using syntax 0 like m4 do not). And one thought is that a future m4 may also expose the ability to choose syntax from this same set. Meanwhile, I have already patched the upcoming GNU m4 1.4.20 manual to be a bit more specific about the syntax it does support, without changing the syntax (1.4.x should remain backwards-compatible; any changes to syntax or the ability to let the user control syntaxes rather than a single syntax being hard-coded would be new to 1.6 or 2.0). https://git.sv.gnu.org/cgit/m4.git/commit/?h=branch-1.4&id=c8a6346c > > 2) Is there going to be a change planned in either gnulib, glibc, or > m4 in terms of the regex syntax? If m4 breaks backwards compatibility, > how will all the m4 scripts be fixed? Isn't it nontrivial? The current discussion is on fixing gnulib so that 'emacs' syntax and syntax 0 are no longer synonymous (ie., make 'emacs' syntax actually match what emacs has done since 2001); this fix is currently independent of glibc, although glibc will likely be changed soon and gnulib go back to mirroring glibc. Changing m4 syntax is not trivial. That's why m4 1.4.20 will still be syntax 0 (no change), but will attempt to document the situation better. I'm struggling to even figure out how to make m4 make it easy to diagnose scripts that use \{ non-portably, so that it becomes possible to opt-in to warnings about a regex that may compile differently in the future (alas, m4's debugmode() builtin macro is not yet easily extensible, and changing that also risks backwards-compatibility headaches). > > 3) What syntax does m4 follow after all? Should it be called the Emacs > syntax or will that passage be changed from the manual? That passage will be changed for 1.4.20 (see above). It is the pre-2001 emacs syntax, aka syntax 0. If you want a quicker table, I can attempt to provide one (note that POSIX BRE do not actually have to support \+ \? or \|; but glibc and gnulib's implementation does): feature 0-or-more 1-or-more 1-or-0 grouping/alternation intervals charclasses syntax 0 (old emacs) * + ? \( \| \) n/a n/a emacs * + ? \( \| \) \{ \} [[:...:]] posix-basic * \+ \? \( \| \) \{ \} [[:...:]] posix-extended * + ? ( | ) { } [[:...:]] -- Eric Blake, Principal Software Engineer Red Hat, Inc. Virtualization: qemu.org | libguestfs.org