Greetings. Way back in May of 2015, Nelson Beebe submitted the following bug report for gawk:
> Date: Mon, 25 May 2015 14:21:04 -0600 (MDT) > From: "Nelson H. F. Beebe" <be...@math.utah.edu> > To: "Arnold Robbins" <arn...@skeeve.com> > Cc: be...@math.utah.edu > Subject: gawk-4.1.3 regexp error > > I just ran an old (1996--date) awk program with gawk-4.1.3 and got an > error that can be exhibited like this: > > % gawk '/[^0-9---]/ {print}' > gawk: cmd. line:1: error: tent of \{\}: /[^0-9---]/ > > As far as I can see, that is a perfectly valid range expression, and > using three hyphens to represent one hyphen is the traditional way > to incorporate a hyphen in the expression. The upshot was that regex didn't support this, and I didn't (at the time) want to tackle trying to fix it. (I did fix the error message, at least.) I submitted a bug report about it. At the time, Paul Eggert said the following: > Date: Mon, 25 May 2015 23:53:31 -0700 > From: Paul Eggert <egg...@cs.ucla.edu> > To: arn...@skeeve.com, 20...@debbugs.gnu.org > Subject: Re: bug#20657: Traditional range expression not accepted in regex/dfa > > arn...@skeeve.com wrote: > > > The bugaboo here is the "---"; it's > > a range expression consisting of minus through minus, and apparently long > > ago was how one got a minus into a bracket expression. > > Actually, long ago expressions like '[^0-9-]' worked just as they do now, > and it wasn't ever necessary to use trailing "---". That being said, > it is true that in 7th Edition Unix '[^0-9---]' meant the same thing as > '[^0-9-]', so in that sense we have an incompatibility with 7th Edition > Unix here. > > > $ ./src/grep '[^0-9---]' /dev/null > > ./src/grep: Invalid range end > > > > The underlying regex and, I believe, dfa routines don't accept this. > > Yes, that's correct. It's not a bug, though, as the regexp is ambiguous > and does not conform to POSIX, which says the following about RE > bracket expressions: "To use a <hyphen> as the starting range point, > it shall either come first in the bracket expression or be specified > as a collating symbol; for example, "[][.-.]-0]", which matches either > a <right-square-bracket> or any character or collating element that > collates between <hyphen> and 0, inclusive." > <http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_03_05> > > In your correspondent's example, the hyphen is a starting range point > but is neither first in the bracket expression nor is specified as a > collating symbol, so the regexp doesn't conform to POSIX. > > Even though it's not a bug I suppose it wouldn't hurt to make the GNU > matchers compatible with 7th Edition Unix here, if someone really wants > to take that task on; it's not urgent, though. I had some time yesterday, and feeling brave and a little stronger in The Force than usual, I came up the with the attached patch. It doesn't break any of my tests. As far as my testing indicates, dfa.c doesn't need a patch, it seems to accept "---" inside brackets for a single minus. If there are no objections, can we get this into Gnulib? Thanks, Arnold
diff --git a/support/regcomp.c b/support/regcomp.c index b607c853..adfe28e2 100644 --- a/support/regcomp.c +++ b/support/regcomp.c @@ -2039,7 +2039,21 @@ peek_token_bracket (re_token_t *token, re_string_t *input, reg_syntax_t syntax) switch (c) { case '-': - token->type = OP_CHARSET_RANGE; + // Special case. V7 Unix grep and Unix awk and mawk allow + // [...---...] (3 minus signs in a bracket expression) to represent + // a single minus sign. Let's try to support that without breaking + // anything else. + if (re_string_peek_byte (input, 1) == '-' && re_string_peek_byte (input, 2) == '-') + { + // advance past the minus signs + (void) re_string_fetch_byte (input); + (void) re_string_fetch_byte (input); + + token->type = CHARACTER; + token->opr.c = '-'; + } + else + token->type = OP_CHARSET_RANGE; break; case ']': token->type = OP_CLOSE_BRACKET;