Jim Meyering wrote: > Jim Meyering wrote: >> Bruno Haible wrote: >>> Paolo, >>> >>>> > [=e=] to match "e" as well as accented versions like é, è and ê). >>>> > That is the one feature that you get with glibc, and that you would >>>> > sacrifice when building --with-included-regex. >>>> >>>> I agree. It's up to distros to choose, of course. >>> >>> If you are on the point of sacrificing a glibc feature in many programs, >>> then IMO you should first talk with the glibc people to see what alternative >>> they can offer. >> >> People who build the tools currently have the choice of using >> --with-included-regex or >> --without-included-regex >> >> Note that putting equivalence classes (and backrefs) aside, the >> interpretation of ranges is done in dfa.c, which means the vast >> majority of range uses never even require use of regexp code. >> >> However, backreferences force these tools to skip the DFA-based >> optimization and resort to running the regexp code. In that case, >> there is a dichotomy. Adding a backreference to a range-including >> regexp would have the surprising consequence of changing how that range >> is interpreted when the tool is built to use glibc's regexp code. >> >> Thus, if we go this route, we are effectively saying >> that people who want self-consistent regex-handling >> in our tools must build with --with-included-regex or end >> up causing subtle problems. >> >> That's a big leap. >> I'm not saying I won't take upstream grep over the edge, >> but I'd like to hear what a few distro-maintainers think. > > To clarify... > I like Arnold's proposal to make regex range handling sane > and locale-independent.
To be precise, this was proposed by Arnold Robbins and Karl Berry. > It goes like this (at least for gawk, grep and sed): > > change how dfa.c interprets ranges like [a-z] > change how gnulib's reg* code handles ranges > > Always use the included regex code (the one from gnulib), > so that its interpretation is consistent with that of dfa.c. > > Grep's current upstream default is to build --with-included-regex, > which makes grep use glibc's regex code. > > To make this proposed change go through, that configure-time option would > have to be eliminated, so that we always build with the gnulib-provided > regex code. Of course, if glibc ever changes, we can detect that and > automatically prefer it when possible. Considering a wider audience, an example will help illustrate what we want to (or dare I say "will" ;-) change. In some locales, the [A-Z] regexp currently matches 25 of the lower case letters. For example, $ echo a| LC_ALL=cs_CZ grep '[A-Z]' a $ echo y| LC_ALL=cs_CZ grep '[A-Z]' y That is obviously undesirable, and this proposal is to make those commands always print nothing, regardless of which locale you use. I.e., they'll act like this: $ echo y| LC_ALL=C grep '[A-Z]' $ I think few will object. Run the following command to see the names of locales installed on your system that make grep exhibit this surprising behavior: for i in $(locale -a);do echo b|LC_ALL=$i /bin/grep -q '[A-Z]' && echo $i; done On Fedora 15, I see 62.