On 06/09/2011 11:58 AM, Bruno Haible wrote:
Paolo,
[=e=] to match "e" as well as accented versions like é, è and ê).
That is the one feature that you get with glibc, and that you would
sacrifice when building --with-included-regex.
I agree. It's up to distros to choose, of course.
If you are on the point of sacrificing a glibc feature in many programs,
then IMO you should first talk with the glibc people to see what alternative
they can offer.
No, I'm not! It's not any different from now. Right now, some
distros/people use --with-included-regex and get broken semantics + no
equivalence classes; others use --without-included-regex and get another
kind of broken semantics.
With my proposal, distros/people that use --with-included-regex would
get understandable semantics + no equivalence classes; others will see
no change.
I don't plan to change the default between the two.
It is probably futile to ask Ulrich Drepper to change how [a-z] is interpreted
by default.
I think it would be possible to discuss it civilly with Uli (not on
Bugzilla though). Unfortunately, more glibc development now seems to be
done by someone I shall not name who sports twice the arrogance and half
the knowledge/talent.
But what would gnulib need so as to implement our "desired"
behaviour? As far as I understand, you want to keep the interpretation of
[=e=] in the POSIX + glibc way, but change the interpretation of [a-z]?
That's a different story. If we could implement [=e=] in gnulib code
using glibc extensions, I would be all for that. But even right now,
using gnulib's regex means sacrificing [=e=]. So that's a separate topic.
The only possibility is that with this change more distros may be using
--with-included-regex. That's their choice, not ours.
Then, what do we need from glibc?
- Do we need a RE_RANGES_IGNORE_LOCALES flag, like Arnold proposed?
No, that would be really really bad to have, for the reasons I mentioned
in my original email.
- Do we need an API that allows us to access the collation elements?
(Or is strcoll and wcscoll sufficient?)
No, they're not, and I thought about designing such an API last year,
but in the end decided that locale behavior of regex are irremediably
broken. For example, when you have a collation element, you can match
it using ranges (e.g. [d-i] matches "ch" in Czech; "ch" collates after
"h"), and even apply negation (e.g. [^c-h] matches "ch" too). However
there is no way to anchor your match to the beginning of the collation
element. So "chci" matches both /[c-h]+ci/ and /[^c-h]+ci/. It is
beyond repair, and [=e=] is the only part that can be salvaged.
Paolo