Paolo Bonzini wrote: > [making this public, there should be no reason not to] > > On 06/08/2011 10:14 PM, Aharon Robbins wrote: >> Hi. As we've discussed a little previously, I finally got tired of >> trying to explain to users why the character range [a-z] was matching >> most uppercase letters also. ("I've found a bug in gawk! [a-z] matches 'C' >> !" >> "No - it's a POSIX locale issue".) This had to be the most F of the FAQs. >> >> So, for the upcoming gawk 4.0, I decided (as Karl put it) to cut the >> Gordian knot and make ranges behave like the C locale, the way it's long >> been documented, and as most people expect. Those who want the POSIX >> behavior can still get it using --posix. >> >> So I went back and just made the fix in the dfa and regex code, by >> introducing a new syntax bit, RE_RANGES_IGNORE_LOCALES, which is turned >> on in RE_SYNTAX_GNU_AWK and added to RE_SYNTAX_AWK for gawk's --traditional >> option. (This turned out to be easier than I'd feared it would be.) > > Actually, it should be even easier, for two reasons. > > First reason: using wcscoll is quite broken, even more so than > collation equivalent ordering. Besides, we should be in control of > the non-_LIBC cases, so we should submit a patch to glibc (or patch > gnulib locally) that makes your RE_RANGES_IGNORE_LOCALES the sole > possibility when _LIBC is not defined. > > Second reason: nowadays, dfa.c always punts on parsing of multibyte > bracketed expressions and defers to regex. The code that handles > mbcsets is there in case someone is using dfaexec with a NULL backref > argument, but we might as well remove it and I wouldn't complain at > all. So, dfa.c also does not need any special casing of > RE_RANGES_IGNORE_LOCALES. Instead, hard_LC_COLLATE should be removed > as a premature optimization. > > The important point is to realize that you cannot fix the whole > problem: --without-included-regex will forever yield glibc's CEO (you > cannot help that, and if distros choose to use it you will still get > bogus bug reports), while the default choice of --with-included-regex > will give wchar_t ordering. The above solution takes this into > account, and within this constraint it provides a much cleaner result: > > 1) no need for POSIXLY_CORRECT (which would be an abuse of > POSIXLY_CORRECT actually... where's POSIX_ME_HARDER when you need it? > ;) > > 2) as a result of (1), no need to say anything in the documentation > (and, anything you say would likely be incorrect in the > --without-included-regex case); > > 3) no need for extra flags and changes to the regex clients; > > 4) no need to care about consistency between dfa.c and regex definitions; > > 5) instant applicability of the solution to all GNU packages just by > upgrading gnulib or importing a new version of regex.
I like the idea. However a potential sticking point is the equivalence class (e.g., using [=e=] to match "e" as well as accented versions like é, è and ê). That is the one feature that you get with glibc, and that you would sacrifice when building --with-included-regex.