proposal: make [A-Z] range handling locale-independent

Jim Meyering Thu, 16 Jun 2011 02:57:39 -0700

Jim Meyering wrote:
> Jim Meyering wrote:
>> Bruno Haible wrote:
>>> Paolo,
>>>
>>>> > [=e=] to match "e" as well as accented versions like é, è and ê).
>>>> > That is the one feature that you get with glibc, and that you would
>>>> > sacrifice when building --with-included-regex.
>>>>
>>>> I agree.  It's up to distros to choose, of course.
>>>
>>> If you are on the point of sacrificing a glibc feature in many programs,
>>> then IMO you should first talk with the glibc people to see what alternative
>>> they can offer.
>>
>> People who build the tools currently have the choice of using
>> --with-included-regex or
>> --without-included-regex
>>
>> Note that putting equivalence classes (and backrefs) aside, the
>> interpretation of ranges is done in dfa.c, which means the vast
>> majority of range uses never even require use of regexp code.
>>
>> However, backreferences force these tools to skip the DFA-based
>> optimization and resort to running the regexp code.  In that case,
>> there is a dichotomy.  Adding a backreference to a range-including
>> regexp would have the surprising consequence of changing how that range
>> is interpreted when the tool is built to use glibc's regexp code.
>>
>> Thus, if we go this route, we are effectively saying
>> that people who want self-consistent regex-handling
>> in our tools must build with --with-included-regex or end
>> up causing subtle problems.
>>
>> That's a big leap.
>> I'm not saying I won't take upstream grep over the edge,
>> but I'd like to hear what a few distro-maintainers think.
>
> To clarify...
> I like Arnold's proposal to make regex range handling sane
> and locale-independent.


To be precise, this was proposed by Arnold Robbins and Karl Berry.

> It goes like this (at least for gawk, grep and sed):
>
>   change how dfa.c interprets ranges like [a-z]
>   change how gnulib's reg* code handles ranges
>
> Always use the included regex code (the one from gnulib),
> so that its interpretation is consistent with that of dfa.c.
>
> Grep's current upstream default is to build --with-included-regex,
> which makes grep use glibc's regex code.
>
> To make this proposed change go through, that configure-time option would
> have to be eliminated, so that we always build with the gnulib-provided
> regex code.  Of course, if glibc ever changes, we can detect that and
> automatically prefer it when possible.

Considering a wider audience, an example will help illustrate
what we want to (or dare I say "will" ;-) change.

In some locales, the [A-Z] regexp currently matches 25 of the
lower case letters.  For example,

    $ echo a| LC_ALL=cs_CZ grep '[A-Z]'
    a
    $ echo y| LC_ALL=cs_CZ grep '[A-Z]'
    y

That is obviously undesirable, and this proposal is to make those
commands always print nothing, regardless of which locale you use.
I.e., they'll act like this:
    $ echo y| LC_ALL=C grep '[A-Z]'
    $
I think few will object.

Run the following command to see the names of locales installed on
your system that make grep exhibit this surprising behavior:

for i in $(locale -a);do echo b|LC_ALL=$i /bin/grep -q '[A-Z]' && echo $i; done

On Fedora 15, I see 62.

proposal: make [A-Z] range handling locale-independent

Reply via email to