glibc already has support for \` and \' as absolute input boundaries,
while ^ and $ are a bit more flexible (depending on
RE_CONTEXT_INDEP_ANCHORS).  Worse, ^ and $ are not portable across all
regex flavors - there are languages where they match at any newline
within the larger input, and where blindly copying a POSIX BRE or ERE
regex to or from these other languages matches different inputs,
representing a security risk if those regex were trying to do data
validation and overlook data that intentionally abuses a newline in
the middle to work around a regex that is not anchored to a full
match.  So POSIX is seriously considering a proposal to add new
escapes that will be portable across more languages to force matches
to align to beginning or end of absolute input regardless of whether ^
and $ can match at newlines embedded within the input.

However, most other languages spell it \A and \z (or sometimes \Z)
rather than \` and \'.  The easiest way for POSIX to specify something
that is portable across languages is to pick an escape that most
languages support and which also has existisng implementation practice
in C.  Therefore, a first step is letting GNU regex parse \A and \z
identically to \` and \'.

See also:
https://www.austingroupbugs.net/view.php?id=1919
https://best.openssf.org/Correctly-Using-Regular-Expressions

---

I'm also open to the idea of adding a new RE_* flag to opt-in to this
spelling on a per-compilation basis for re_compile, or even a new
REG_* for use with regcomp.
---
 posix/regcomp.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/posix/regcomp.c b/posix/regcomp.c
index 69675d81f7..848a2823a3 100644
--- a/posix/regcomp.c
+++ b/posix/regcomp.c
@@ -1885,6 +1885,7 @@ peek_token (re_token_t *token, re_string_t *input, 
reg_syntax_t syntax)
            token->type = OP_NOTSPACE;
          break;
        case '`':
+        case 'A':
          if (!(syntax & RE_NO_GNU_OPS))
            {
              token->type = ANCHOR;
@@ -1892,6 +1893,7 @@ peek_token (re_token_t *token, re_string_t *input, 
reg_syntax_t syntax)
            }
          break;
        case '\'':
+        case 'z':
          if (!(syntax & RE_NO_GNU_OPS))
            {
              token->type = ANCHOR;

base-commit: e78caeb4ff812ae19d24d65f4d4d48508154277b
-- 
2.49.0


Reply via email to