Re: Dealing with character ranges in grep

Jim Meyering Thu, 09 Jun 2011 02:34:33 -0700

Paolo Bonzini wrote:
> [making this public, there should be no reason not to]
>
> On 06/08/2011 10:14 PM, Aharon Robbins wrote:
>> Hi.  As we've discussed a little previously, I finally got tired of
>> trying to explain to users why the character range [a-z] was matching
>> most uppercase letters also.  ("I've found a bug in gawk! [a-z] matches 'C' 
>> !"
>> "No - it's a POSIX locale issue".)  This had to be the most F of the FAQs.
>>
>> So, for the upcoming gawk 4.0, I decided (as Karl put it) to cut the
>> Gordian knot and make ranges behave like the C locale, the way it's long
>> been documented, and as most people expect.  Those who want the POSIX
>> behavior can still get it using --posix.
>>
>> So I went back and just made the fix in the dfa and regex code, by
>> introducing a new syntax bit, RE_RANGES_IGNORE_LOCALES, which is turned
>> on in RE_SYNTAX_GNU_AWK and added to RE_SYNTAX_AWK for gawk's --traditional
>> option.  (This turned out to be easier than I'd feared it would be.)
>
> Actually, it should be even easier, for two reasons.
>
> First reason: using wcscoll is quite broken, even more so than
> collation equivalent ordering.  Besides, we should be in control of
> the non-_LIBC cases, so we should submit a patch to glibc (or patch
> gnulib locally) that makes your RE_RANGES_IGNORE_LOCALES the sole
> possibility when _LIBC is not defined.
>
> Second reason: nowadays, dfa.c always punts on parsing of multibyte
> bracketed expressions and defers to regex.  The code that handles
> mbcsets is there in case someone is using dfaexec with a NULL backref
> argument, but we might as well remove it and I wouldn't complain at
> all. So, dfa.c also does not need any special casing of
> RE_RANGES_IGNORE_LOCALES.  Instead, hard_LC_COLLATE should be removed
> as a premature optimization.
>
> The important point is to realize that you cannot fix the whole
> problem: --without-included-regex will forever yield glibc's CEO (you
> cannot help that, and if distros choose to use it you will still get
> bogus bug reports), while the default choice of --with-included-regex
> will give wchar_t ordering.  The above solution takes this into
> account, and within this constraint it provides a much cleaner result:
>
> 1) no need for POSIXLY_CORRECT (which would be an abuse of
> POSIXLY_CORRECT actually... where's POSIX_ME_HARDER when you need it?
> ;)
>
> 2) as a result of (1), no need to say anything in the documentation
> (and, anything you say would likely be incorrect in the
> --without-included-regex case);
>
> 3) no need for extra flags and changes to the regex clients;
>
> 4) no need to care about consistency between dfa.c and regex definitions;
>
> 5) instant applicability of the solution to all GNU packages just by
> upgrading gnulib or importing a new version of regex.


I like the idea.
However a potential sticking point is the equivalence class (e.g., using
[=e=] to match "e" as well as accented versions like é, è and ê).
That is the one feature that you get with glibc, and that you would
sacrifice when building --with-included-regex.

Re: Dealing with character ranges in grep

Reply via email to