[making this public, there should be no reason not to]
On 06/08/2011 10:14 PM, Aharon Robbins wrote:
Hi. As we've discussed a little previously, I finally got tired of
trying to explain to users why the character range [a-z] was matching
most uppercase letters also. ("I've found a bug in gawk! [a-z] matches 'C' !"
"No - it's a POSIX locale issue".) This had to be the most F of the FAQs.
So, for the upcoming gawk 4.0, I decided (as Karl put it) to cut the
Gordian knot and make ranges behave like the C locale, the way it's long
been documented, and as most people expect. Those who want the POSIX
behavior can still get it using --posix.
So I went back and just made the fix in the dfa and regex code, by
introducing a new syntax bit, RE_RANGES_IGNORE_LOCALES, which is turned
on in RE_SYNTAX_GNU_AWK and added to RE_SYNTAX_AWK for gawk's --traditional
option. (This turned out to be easier than I'd feared it would be.)
Actually, it should be even easier, for two reasons.
First reason: using wcscoll is quite broken, even more so than collation
equivalent ordering. Besides, we should be in control of the non-_LIBC
cases, so we should submit a patch to glibc (or patch gnulib locally)
that makes your RE_RANGES_IGNORE_LOCALES the sole possibility when _LIBC
is not defined.
Second reason: nowadays, dfa.c always punts on parsing of multibyte
bracketed expressions and defers to regex. The code that handles
mbcsets is there in case someone is using dfaexec with a NULL backref
argument, but we might as well remove it and I wouldn't complain at all.
So, dfa.c also does not need any special casing of
RE_RANGES_IGNORE_LOCALES. Instead, hard_LC_COLLATE should be removed as
a premature optimization.
The important point is to realize that you cannot fix the whole problem:
--without-included-regex will forever yield glibc's CEO (you cannot help
that, and if distros choose to use it you will still get bogus bug
reports), while the default choice of --with-included-regex will give
wchar_t ordering. The above solution takes this into account, and
within this constraint it provides a much cleaner result:
1) no need for POSIXLY_CORRECT (which would be an abuse of
POSIXLY_CORRECT actually... where's POSIX_ME_HARDER when you need it? ;)
2) as a result of (1), no need to say anything in the documentation
(and, anything you say would likely be incorrect in the
--without-included-regex case);
3) no need for extra flags and changes to the regex clients;
4) no need to care about consistency between dfa.c and regex definitions;
5) instant applicability of the solution to all GNU packages just by
upgrading gnulib or importing a new version of regex.
So, unlike before, you sold me on this, *provided the above plan is
implemented*. The difference is that this approach, I think, does not
cause more headaches than it solves. Hopefully, it will not provide any
new headache assuming we can synchronize decently a release of gawk,
grep and sed!
Would it be too much to ask to hold gawk 4.0 until the above plan is
realized? It's strictly about gawk/grep/gnulib; no need to involve
glibc from the beginning. Even better, would anyone help with the work
while I'm on vacation (from Saturday till the 26th of June)?
Paolo