On Wed, Apr 16, 2025 at 05:45:46PM +0200, Bruno Haible wrote:
> Eric Blake wrote:
> > If you want a quicker table, I can attempt to provide one (note that
> > POSIX BRE do not actually have to support \+ \? or \|; but glibc and
> > gnulib's implementation does):
> > 
> >    feature   0-or-more 1-or-more 1-or-0 grouping/alternation intervals 
> > charclasses
> > syntax
> > 0 (old emacs)     *       +        ?       \( \| \)            n/a      n/a
> > emacs             *       +        ?       \( \| \)           \{ \}    
> > [[:...:]]
> > posix-basic       *      \+       \?       \( \| \)           \{ \}    
> > [[:...:]]
> > posix-extended    *       +        ?        (  |  )            {  }    
> > [[:...:]]
> 
> This is a very nice table. How about adding it to the Gnulib documentation?
> 
> 
> I tried to reduce the width by transposing the table, but it is not
> as pretty as in the way you wrote it:
> 
>     syntax             0 (old emacs)   emacs       posix-basic  posix-extended
> feature
> 0-or-more                 *             *           *            *
> 1-or-more                 +             +           \+           +
> 1-or-0                    ?             ?           \?           ?
> grouping/alternation      \( \| \)      \( \| \)    \( \| \)     ( | )
> intervals                 n/a           \{ \}       \{ \}        { }
> charclasses               n/a           [[:...:]]   [[:...:]]    [[:...:]]

It may also be worth listing other flavors: regex.h also documents
gnu-awk, posix-awk, grep, and egrep flavors (at which point you have
to consider whether there are more rows or columns on whether the
transposition makes sense).  Admittedly, these other flavors tend to
be refinements of BRE and ERE, where the features of this table are
still present, but where other knobs are fine-tuned (for example,
whether . matches newline or NUL, whether ^ forms an anchor inside of
a group, whether a mis-spelled interval silently parses successfully
as a literal or causes a failure to compile, and so forth) - at which
point we could also list more features in the table.

One other problem: recent POSIX has added support for non-greedy
repetition to ERE; it looks like neither glibc nor gnulib's regex has
implemented that yet, but it is an awesome feature (although the POSIX
wording is still in flux:
https://www.austingroupbugs.net/view.php?id=1857).  "For example, the
ERE "([ab]{6}|a)*?b" matches the first five characters of the string
"aaaabbbb" as this is the shortest for the minimal repetition
"*?". Matching with the least repetitions would match the first seven
characters by using one repetition of "[ab]{6}" instead of four
repetitions of "a". This distinction is only possible because the
alternatives in an ERE alternation are chosen according to which gives
the longest (or shortest) match."

On a system with glibc 2.39 and grep 3.11, I can demonstrate that we
don't yet implement the new POSIX requirement on non-greedy operators,
since:

$ echo aaaabbbb | grep --color -E '([ab]{6}|a)*?b'
aaaabbbb

colors all 8 bytes rather than just the first 5, as POSIX wants.  It's
also easy to see that REG_MINIMAL is not yet defined in regex.h, which
is essential to the implementation of ??, *?, +?, and {...}?
non-greedy ERE operators (when REG_MINIMAL is specified to regcomp(),
all repetitions without a ? suffix become minimal and the presence of
the trailing ? swaps to maximal).  So if nothing else, gnulib's
doc/posix-functions/regcomp.texi should probably mention the newer
POSIX features that we don't implement yet.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.
Virtualization:  qemu.org | libguestfs.org


Reply via email to