On Wed, Apr 16, 2025, 10:22 AM Eric Blake <ebl...@redhat.com> wrote:

> On Tue, Apr 15, 2025 at 06:34:35PM -0400, Nikolaos Chatzikonstantinou
> wrote:
> > > Since there's already another long thread on how m4 does not match
> > > current emacs regex but why enabling intervals would break at least
> > > autoconf 2.72, I'm inclined to update the m4 manual rather than use
> > > RE_SYNTAX_EMACS, whether or not this patch is accepted.
> >
> > I'm having a bit of issue following, but this is relevant to me, so
> > I'd like to ask the following questions:
> >
> > 1) <regex.h> has two interfaces, the old glibc one that gnulib
> > implements and the POSIX one with regcomp() and regexec(). What I've
> > noticed is inconsistency between the two interfaces in syntax:
> >
> >     # m4 regexp that matches:
> >     regexp(`foo', `[a-z]+')
> >
> > This will not match with POSIX:
> >
> >     regcomp(&re, "[a-z]+", 0);
> >     assert(regexec(&re, "foo", 0, NULL, 0) == REG_NOMATCH);
> >
> > The reason is that POSIX BRE wants [a-z]\+ instead. So the question
> > is, does this mean the two interfaces have incompatible syntaxes?
>
> m4 uses re_compile_pattern() with syntax 0 (which at one point used to
> be RE_SYNTAX_EMACS, but this thread shows that is no longer the case).
> regcomp() is a POSIX interface, but it basically forces a syntax of
> either RE_SYNTAX_POSIX_EXTENDED or RE_SYNTAX_POSIX_BASIC.
>
> The re_compile_pattern() interface is superior: it offers greater
> flexibility to the user, and is a superset of the regcomp() interface
> (which can only choose between two syntax levels, rather than the
> wider range of re_compile_pattern() syntaxes and individual feature
> knobs).
>
> > I
> > don't think that's clarified in either the glibc manual
> > <
> https://www.gnu.org/software/libc/manual/html_node/Regular-Expressions.html
> >
> > and gnulib's
> > <
> https://www.gnu.org/software/gnulib/manual/html_node/The-Backslash-Character.html
> >.
> > Perhaps
> > gnulib should be agnostic of this issue (although worth a mention?)
> > but certainly glibc should mention it.
>
> Gnulib does have a way to list ALL of the regex flavors; the
> regexprops-generic module creates doc/regexprops-generic.texi as a
> drop-in chapter to any larger project's manual that exposes the choice
> of syntax to the end user.  And GNU findutils does just that (you can
> use 'find --regextype=...' with 'emacs', 'posix-awk', 'posix-basic',
> 'posix-egrep', 'posix-extended'):
>
> https://www.gnu.org/software/findutils/manual/html_mono/find.html#Regular-Expressions
>
> This thread deals with the fact that 'emacs' syntax has changed over
> the years (prior to 2001, it did not have intervals or character
> classes; nowadays emacs has those but programs using syntax 0 like m4
> do not).
>
> And one thought is that a future m4 may also expose the ability to
> choose syntax from this same set.
>
> Meanwhile, I have already patched the upcoming GNU m4 1.4.20 manual to
> be a bit more specific about the syntax it does support, without
> changing the syntax (1.4.x should remain backwards-compatible; any
> changes to syntax or the ability to let the user control syntaxes
> rather than a single syntax being hard-coded would be new to 1.6 or
> 2.0).
> https://git.sv.gnu.org/cgit/m4.git/commit/?h=branch-1.4&id=c8a6346c
>
> >
> > 2) Is there going to be a change planned in either gnulib, glibc, or
> > m4 in terms of the regex syntax? If m4 breaks backwards compatibility,
> > how will all the m4 scripts be fixed? Isn't it nontrivial?
>
> The current discussion is on fixing gnulib so that 'emacs' syntax and
> syntax 0 are no longer synonymous (ie., make 'emacs' syntax actually
> match what emacs has done since 2001); this fix is currently
> independent of glibc, although glibc will likely be changed soon and
> gnulib go back to mirroring glibc.
>
> Changing m4 syntax is not trivial.  That's why m4 1.4.20 will still be
> syntax 0 (no change), but will attempt to document the situation
> better.  I'm struggling to even figure out how to make m4 make it easy
> to diagnose scripts that use \{ non-portably, so that it becomes
> possible to opt-in to warnings about a regex that may compile
> differently in the future (alas, m4's debugmode() builtin macro is not
> yet easily extensible, and changing that also risks
> backwards-compatibility headaches).
>
> >
> > 3) What syntax does m4 follow after all? Should it be called the Emacs
> > syntax or will that passage be changed from the manual?
>
> That passage will be changed for 1.4.20 (see above).  It is the
> pre-2001 emacs syntax, aka syntax 0.
>
> If you want a quicker table, I can attempt to provide one (note that
> POSIX BRE do not actually have to support \+ \? or \|; but glibc and
> gnulib's implementation does):
>
>    feature   0-or-more 1-or-more 1-or-0 grouping/alternation intervals
> charclasses
> syntax
> 0 (old emacs)     *       +        ?       \( \| \)            n/a      n/a
> emacs             *       +        ?       \( \| \)           \{ \}
> [[:...:]]
> posix-basic       *      \+       \?       \( \| \)           \{ \}
> [[:...:]]
> posix-extended    *       +        ?        (  |  )            {  }
> [[:...:]]
>

Thank you for the response. I should confess that I'm currently rewriting
GNU m4 in Python, aiming for 100% compatibility with 1.4.19. I had the
issue of implementing regexp() because Python doesn't have that syntax. I
wrote a shim for regcomp() but then I realized it's incompatible. Now I
wrote one for the GNU interface (re_* functions).

How I ended up attempting to rewrite m4 is kind of long winded but in any
case I was encouraged by seeming issues after recent compiler build
failures.

I'm hoping that this is a welcome effort? I'm expecting it to be done
sometime within a month hopefully. I intend to use the Python
implementation to then write a Rust one. Both will be GPLv3+ licensed.

Regards,
Nikolaos Chatzikonstantinou

>

Reply via email to