On Wed, Apr 16, 2025 at 05:45:46PM +0200, Bruno Haible wrote: > Eric Blake wrote: > > If you want a quicker table, I can attempt to provide one (note that > > POSIX BRE do not actually have to support \+ \? or \|; but glibc and > > gnulib's implementation does): > > > > feature 0-or-more 1-or-more 1-or-0 grouping/alternation intervals > > charclasses > > syntax > > 0 (old emacs) * + ? \( \| \) n/a n/a > > emacs * + ? \( \| \) \{ \} > > [[:...:]] > > posix-basic * \+ \? \( \| \) \{ \} > > [[:...:]] > > posix-extended * + ? ( | ) { } > > [[:...:]] > > This is a very nice table. How about adding it to the Gnulib documentation? > > > I tried to reduce the width by transposing the table, but it is not > as pretty as in the way you wrote it: > > syntax 0 (old emacs) emacs posix-basic posix-extended > feature > 0-or-more * * * * > 1-or-more + + \+ + > 1-or-0 ? ? \? ? > grouping/alternation \( \| \) \( \| \) \( \| \) ( | ) > intervals n/a \{ \} \{ \} { } > charclasses n/a [[:...:]] [[:...:]] [[:...:]]
It may also be worth listing other flavors: regex.h also documents gnu-awk, posix-awk, grep, and egrep flavors (at which point you have to consider whether there are more rows or columns on whether the transposition makes sense). Admittedly, these other flavors tend to be refinements of BRE and ERE, where the features of this table are still present, but where other knobs are fine-tuned (for example, whether . matches newline or NUL, whether ^ forms an anchor inside of a group, whether a mis-spelled interval silently parses successfully as a literal or causes a failure to compile, and so forth) - at which point we could also list more features in the table. One other problem: recent POSIX has added support for non-greedy repetition to ERE; it looks like neither glibc nor gnulib's regex has implemented that yet, but it is an awesome feature (although the POSIX wording is still in flux: https://www.austingroupbugs.net/view.php?id=1857). "For example, the ERE "([ab]{6}|a)*?b" matches the first five characters of the string "aaaabbbb" as this is the shortest for the minimal repetition "*?". Matching with the least repetitions would match the first seven characters by using one repetition of "[ab]{6}" instead of four repetitions of "a". This distinction is only possible because the alternatives in an ERE alternation are chosen according to which gives the longest (or shortest) match." On a system with glibc 2.39 and grep 3.11, I can demonstrate that we don't yet implement the new POSIX requirement on non-greedy operators, since: $ echo aaaabbbb | grep --color -E '([ab]{6}|a)*?b' aaaabbbb colors all 8 bytes rather than just the first 5, as POSIX wants. It's also easy to see that REG_MINIMAL is not yet defined in regex.h, which is essential to the implementation of ??, *?, +?, and {...}? non-greedy ERE operators (when REG_MINIMAL is specified to regcomp(), all repetitions without a ? suffix become minimal and the presence of the trailing ? swaps to maximal). So if nothing else, gnulib's doc/posix-functions/regcomp.texi should probably mention the newer POSIX features that we don't implement yet. -- Eric Blake, Principal Software Engineer Red Hat, Inc. Virtualization: qemu.org | libguestfs.org