On 06/09/2011 11:58 AM, Bruno Haible wrote:
Paolo,

[=e=] to match "e" as well as accented versions like é, è and ê).
That is the one feature that you get with glibc, and that you would
sacrifice when building --with-included-regex.

I agree.  It's up to distros to choose, of course.

If you are on the point of sacrificing a glibc feature in many programs,
then IMO you should first talk with the glibc people to see what alternative
they can offer.

No, I'm not! It's not any different from now. Right now, some distros/people use --with-included-regex and get broken semantics + no equivalence classes; others use --without-included-regex and get another kind of broken semantics.

With my proposal, distros/people that use --with-included-regex would get understandable semantics + no equivalence classes; others will see no change.

I don't plan to change the default between the two.

It is probably futile to ask Ulrich Drepper to change how [a-z] is interpreted
by default.

I think it would be possible to discuss it civilly with Uli (not on Bugzilla though). Unfortunately, more glibc development now seems to be done by someone I shall not name who sports twice the arrogance and half the knowledge/talent.

But what would gnulib need so as to implement our "desired"
behaviour? As far as I understand, you want to keep the interpretation of
[=e=] in the POSIX + glibc way, but change the interpretation of [a-z]?

That's a different story. If we could implement [=e=] in gnulib code using glibc extensions, I would be all for that. But even right now, using gnulib's regex means sacrificing [=e=]. So that's a separate topic.

The only possibility is that with this change more distros may be using --with-included-regex. That's their choice, not ours.

Then, what do we need from glibc?
   - Do we need a RE_RANGES_IGNORE_LOCALES flag, like Arnold proposed?

No, that would be really really bad to have, for the reasons I mentioned in my original email.

   - Do we need an API that allows us to access the collation elements?
     (Or is strcoll and wcscoll sufficient?)

No, they're not, and I thought about designing such an API last year, but in the end decided that locale behavior of regex are irremediably broken. For example, when you have a collation element, you can match it using ranges (e.g. [d-i] matches "ch" in Czech; "ch" collates after "h"), and even apply negation (e.g. [^c-h] matches "ch" too). However there is no way to anchor your match to the beginning of the collation element. So "chci" matches both /[c-h]+ci/ and /[^c-h]+ci/. It is beyond repair, and [=e=] is the only part that can be salvaged.

Paolo

Reply via email to