Greetings.

Way back in May of 2015, Nelson Beebe submitted the following
bug report for gawk:

> Date: Mon, 25 May 2015 14:21:04 -0600 (MDT)
> From: "Nelson H. F. Beebe" <be...@math.utah.edu>
> To: "Arnold Robbins" <arn...@skeeve.com>
> Cc: be...@math.utah.edu
> Subject: gawk-4.1.3 regexp error
> 
> I just ran an old (1996--date) awk program with gawk-4.1.3 and got an
> error that can be exhibited like this:
> 
>       % gawk '/[^0-9---]/ {print}'
>       gawk: cmd. line:1: error: tent of \{\}: /[^0-9---]/
> 
> As far as I can see, that is a perfectly valid range expression, and
> using three hyphens to represent one hyphen is the traditional way
> to incorporate a hyphen in the expression.

The upshot was that regex didn't support this, and I didn't (at the
time) want to tackle trying to fix it.  (I did fix the error message,
at least.)

I submitted a bug report about it. At the time, Paul Eggert said the following:

> Date: Mon, 25 May 2015 23:53:31 -0700
> From: Paul Eggert <egg...@cs.ucla.edu>
> To: arn...@skeeve.com, 20...@debbugs.gnu.org
> Subject: Re: bug#20657: Traditional range expression not accepted in regex/dfa
> 
> arn...@skeeve.com wrote:
> 
> > The bugaboo here is the "---"; it's
> > a range expression consisting of minus through minus, and apparently long
> > ago was how one got a minus into a bracket expression.
> 
> Actually, long ago expressions like '[^0-9-]' worked just as they do now,
> and it wasn't ever necessary to use trailing "---".  That being said,
> it is true that in 7th Edition Unix '[^0-9---]' meant the same thing as
> '[^0-9-]', so in that sense we have an incompatibility with 7th Edition
> Unix here.
> 
> >     $ ./src/grep '[^0-9---]' /dev/null
> >     ./src/grep: Invalid range end
> >
> > The underlying regex and, I believe, dfa routines don't accept this.
> 
> Yes, that's correct.  It's not a bug, though, as the regexp is ambiguous
> and does not conform to POSIX, which says the following about RE
> bracket expressions: "To use a <hyphen> as the starting range point,
> it shall either come first in the bracket expression or be specified
> as a collating symbol; for example, "[][.-.]-0]", which matches either
> a <right-square-bracket> or any character or collating element that
> collates between <hyphen> and 0, inclusive."
> <http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_03_05>
>  
> In your correspondent's example, the hyphen is a starting range point
> but is neither first in the bracket expression nor is specified as a
> collating symbol, so the regexp doesn't conform to POSIX.
> 
> Even though it's not a bug I suppose it wouldn't hurt to make the GNU
> matchers compatible with 7th Edition Unix here, if someone really wants
> to take that task on; it's not urgent, though.

I had some time yesterday, and feeling brave and a little stronger in
The Force than usual, I came up the with the attached patch. It doesn't
break any of my tests.

As far as my testing indicates, dfa.c doesn't need a patch, it seems
to accept "---" inside brackets for a single minus.

If there are no objections, can we get this into Gnulib?

Thanks,

Arnold
diff --git a/support/regcomp.c b/support/regcomp.c
index b607c853..adfe28e2 100644
--- a/support/regcomp.c
+++ b/support/regcomp.c
@@ -2039,7 +2039,21 @@ peek_token_bracket (re_token_t *token, re_string_t *input, reg_syntax_t syntax)
   switch (c)
     {
     case '-':
-      token->type = OP_CHARSET_RANGE;
+      // Special case. V7 Unix grep and Unix awk and mawk allow
+      // [...---...] (3 minus signs in a bracket expression) to represent
+      // a single minus sign.  Let's try to support that without breaking
+      // anything else.
+      if (re_string_peek_byte (input, 1) == '-' && re_string_peek_byte (input, 2) == '-')
+	{
+	   // advance past the minus signs
+	   (void) re_string_fetch_byte (input);
+	   (void) re_string_fetch_byte (input);
+
+	   token->type = CHARACTER;
+	   token->opr.c = '-';
+	}
+      else
+	token->type = OP_CHARSET_RANGE;
       break;
     case ']':
       token->type = OP_CLOSE_BRACKET;

Reply via email to