Re: Dealing with character ranges in grep

Paolo Bonzini Thu, 09 Jun 2011 04:03:25 -0700

On 06/09/2011 11:58 AM, Bruno Haible wrote:

Paolo,

[=e=] to match "e" as well as accented versions like é, è and ê).
That is the one feature that you get with glibc, and that you would
sacrifice when building --with-included-regex.


I agree.  It's up to distros to choose, of course.


If you are on the point of sacrificing a glibc feature in many programs,
then IMO you should first talk with the glibc people to see what alternative
they can offer.

No, I'm not! It's not any different from now. Right now, somedistros/people use --with-included-regex and get broken semantics + noequivalence classes; others use --without-included-regex and get anotherkind of broken semantics.

With my proposal, distros/people that use --with-included-regex wouldget understandable semantics + no equivalence classes; others will seeno change.


I don't plan to change the default between the two.

It is probably futile to ask Ulrich Drepper to change how [a-z] is interpreted
by default.

I think it would be possible to discuss it civilly with Uli (not onBugzilla though). Unfortunately, more glibc development now seems to bedone by someone I shall not name who sports twice the arrogance and halfthe knowledge/talent.

But what would gnulib need so as to implement our "desired"
behaviour? As far as I understand, you want to keep the interpretation of
[=e=] in the POSIX + glibc way, but change the interpretation of [a-z]?

That's a different story. If we could implement [=e=] in gnulib codeusing glibc extensions, I would be all for that. But even right now,using gnulib's regex means sacrificing [=e=]. So that's a separate topic.

The only possibility is that with this change more distros may be using--with-included-regex. That's their choice, not ours.

Then, what do we need from glibc?
   - Do we need a RE_RANGES_IGNORE_LOCALES flag, like Arnold proposed?

No, that would be really really bad to have, for the reasons I mentionedin my original email.

   - Do we need an API that allows us to access the collation elements?
     (Or is strcoll and wcscoll sufficient?)

No, they're not, and I thought about designing such an API last year,but in the end decided that locale behavior of regex are irremediablybroken. For example, when you have a collation element, you can matchit using ranges (e.g. [d-i] matches "ch" in Czech; "ch" collates after"h"), and even apply negation (e.g. [^c-h] matches "ch" too). Howeverthere is no way to anchor your match to the beginning of the collationelement. So "chci" matches both /[c-h]+ci/ and /[^c-h]+ci/. It isbeyond repair, and [=e=] is the only part that can be salvaged.


Paolo

Re: Dealing with character ranges in grep

Reply via email to