On Wed, Apr 16, 2025, 10:22 AM Eric Blake <ebl...@redhat.com> wrote:
> On Tue, Apr 15, 2025 at 06:34:35PM -0400, Nikolaos Chatzikonstantinou > wrote: > > > Since there's already another long thread on how m4 does not match > > > current emacs regex but why enabling intervals would break at least > > > autoconf 2.72, I'm inclined to update the m4 manual rather than use > > > RE_SYNTAX_EMACS, whether or not this patch is accepted. > > > > I'm having a bit of issue following, but this is relevant to me, so > > I'd like to ask the following questions: > > > > 1) <regex.h> has two interfaces, the old glibc one that gnulib > > implements and the POSIX one with regcomp() and regexec(). What I've > > noticed is inconsistency between the two interfaces in syntax: > > > > # m4 regexp that matches: > > regexp(`foo', `[a-z]+') > > > > This will not match with POSIX: > > > > regcomp(&re, "[a-z]+", 0); > > assert(regexec(&re, "foo", 0, NULL, 0) == REG_NOMATCH); > > > > The reason is that POSIX BRE wants [a-z]\+ instead. So the question > > is, does this mean the two interfaces have incompatible syntaxes? > > m4 uses re_compile_pattern() with syntax 0 (which at one point used to > be RE_SYNTAX_EMACS, but this thread shows that is no longer the case). > regcomp() is a POSIX interface, but it basically forces a syntax of > either RE_SYNTAX_POSIX_EXTENDED or RE_SYNTAX_POSIX_BASIC. > > The re_compile_pattern() interface is superior: it offers greater > flexibility to the user, and is a superset of the regcomp() interface > (which can only choose between two syntax levels, rather than the > wider range of re_compile_pattern() syntaxes and individual feature > knobs). > > > I > > don't think that's clarified in either the glibc manual > > < > https://www.gnu.org/software/libc/manual/html_node/Regular-Expressions.html > > > > and gnulib's > > < > https://www.gnu.org/software/gnulib/manual/html_node/The-Backslash-Character.html > >. > > Perhaps > > gnulib should be agnostic of this issue (although worth a mention?) > > but certainly glibc should mention it. > > Gnulib does have a way to list ALL of the regex flavors; the > regexprops-generic module creates doc/regexprops-generic.texi as a > drop-in chapter to any larger project's manual that exposes the choice > of syntax to the end user. And GNU findutils does just that (you can > use 'find --regextype=...' with 'emacs', 'posix-awk', 'posix-basic', > 'posix-egrep', 'posix-extended'): > > https://www.gnu.org/software/findutils/manual/html_mono/find.html#Regular-Expressions > > This thread deals with the fact that 'emacs' syntax has changed over > the years (prior to 2001, it did not have intervals or character > classes; nowadays emacs has those but programs using syntax 0 like m4 > do not). > > And one thought is that a future m4 may also expose the ability to > choose syntax from this same set. > > Meanwhile, I have already patched the upcoming GNU m4 1.4.20 manual to > be a bit more specific about the syntax it does support, without > changing the syntax (1.4.x should remain backwards-compatible; any > changes to syntax or the ability to let the user control syntaxes > rather than a single syntax being hard-coded would be new to 1.6 or > 2.0). > https://git.sv.gnu.org/cgit/m4.git/commit/?h=branch-1.4&id=c8a6346c > > > > > 2) Is there going to be a change planned in either gnulib, glibc, or > > m4 in terms of the regex syntax? If m4 breaks backwards compatibility, > > how will all the m4 scripts be fixed? Isn't it nontrivial? > > The current discussion is on fixing gnulib so that 'emacs' syntax and > syntax 0 are no longer synonymous (ie., make 'emacs' syntax actually > match what emacs has done since 2001); this fix is currently > independent of glibc, although glibc will likely be changed soon and > gnulib go back to mirroring glibc. > > Changing m4 syntax is not trivial. That's why m4 1.4.20 will still be > syntax 0 (no change), but will attempt to document the situation > better. I'm struggling to even figure out how to make m4 make it easy > to diagnose scripts that use \{ non-portably, so that it becomes > possible to opt-in to warnings about a regex that may compile > differently in the future (alas, m4's debugmode() builtin macro is not > yet easily extensible, and changing that also risks > backwards-compatibility headaches). > > > > > 3) What syntax does m4 follow after all? Should it be called the Emacs > > syntax or will that passage be changed from the manual? > > That passage will be changed for 1.4.20 (see above). It is the > pre-2001 emacs syntax, aka syntax 0. > > If you want a quicker table, I can attempt to provide one (note that > POSIX BRE do not actually have to support \+ \? or \|; but glibc and > gnulib's implementation does): > > feature 0-or-more 1-or-more 1-or-0 grouping/alternation intervals > charclasses > syntax > 0 (old emacs) * + ? \( \| \) n/a n/a > emacs * + ? \( \| \) \{ \} > [[:...:]] > posix-basic * \+ \? \( \| \) \{ \} > [[:...:]] > posix-extended * + ? ( | ) { } > [[:...:]] > Thank you for the response. I should confess that I'm currently rewriting GNU m4 in Python, aiming for 100% compatibility with 1.4.19. I had the issue of implementing regexp() because Python doesn't have that syntax. I wrote a shim for regcomp() but then I realized it's incompatible. Now I wrote one for the GNU interface (re_* functions). How I ended up attempting to rewrite m4 is kind of long winded but in any case I was encouraged by seeming issues after recent compiler build failures. I'm hoping that this is a welcome effort? I'm expecting it to be done sometime within a month hopefully. I intend to use the Python implementation to then write a Rust one. Both will be GPLv3+ licensed. Regards, Nikolaos Chatzikonstantinou >