On 06/09/2011 01:53 PM, Bruno Haible wrote:
Paolo,
My proposal wouldn't change defaults, which is why I believe that this
is a separate topic.
But at the same time you are pushing for the use of --with-included-regex.
We found out that by doing this, the equivalence classes feature gets lost,
and the divergence between glibc and gnulib becomes greater.
I'm not pushing for that, though probably Karl and Aharon would be. :)
It can also make --with-included-regex the default on its own.
That would be bad IMO. It doesn't matter whether we split the process
in 1, 5 or 10 steps; what matters is that at each point all maintainers
agree on *not* changing the default for --with-included-regex. That's
why I want a good solution done fast, then we can perfect that solution
later.
We'd need glibc to export two functions in both multi-byte and
wide-character versions:
1) streqcoll(S1, S2) and wcseqcoll(S1, S2) would be the same as strcoll
and wcscoll, but they would compare only according to primary weights.
A slightly more formal definition is that streqcoll(S1, S2) == 0 iff S1
matches the \`[=C1=][=C2=][=C3=]...[=Cn=]\' regular expression, where Ci
are the characters of S2 (I'd need to double check this against POSIX
though). When non-zero, the result of streqcoll(S1, S2) would be the
same as strcoll(S1, S2). Likewise, glibc could provide streqxfrm and
wcseqxfrm, with the definition that strcmp(streqxfrm(S1), streqxfrm(S2))
== streqcoll(S1, S2).
2) On top of this, [.ss.] could be implemented using an additional
function mbelemlen(S) giving the length of the first collation element
in S. [.S1.] would be rejected unless mbelemlen(S1) == strlen(S1), and
[.S1.] would match S2 if strcoll(S1, strndup(S2, mbelemlen(S2))) == 0.
wcelemlen could be provided likewise.
These are the minimal extensions that would be required to support full
regular expression features portably and in a manner that is compatible
with glibc, except for ranges
Great! These look like a good basis for discussing with the glibc people.
Ad 1): Is streqcoll symmetric? That is, is streqcoll(S1, S2) the same as
streqcoll(S2, S1)?
Yes. It's really an equivalence relation, in the mathematical sense.
Here are definitions like they might appear in POSIX (I'm omitting
wcseqcoll and wcseqxfrm).
===========================================================================
NAME
streqcoll - string comparison using equivalence class collation
SYNOPSIS
#define _GNU_SOURCE
#include <string.h>
int streqcoll(const char *s1, const char *s2);
DESCRIPTION
The streqcoll() function shall compare the string pointed to by s1 to
the string pointed to by s2, both interpreted as appropriate to the
LC_COLLATE category of the current locale. Each character is only
compared according to its primary equivalence class, as described under
"Collation Order" in the POSIX standard.
The streqcoll() function shall not change the setting of errno if
successful.
Since no return value is reserved to indicate an error, an application
wishing to check for error situations should set errno to 0, then call
streqcoll(), then check errno.
RETURN VALUE
Upon successful completion, streqcoll() shall return an integer greater
than, equal to, or less than 0, according to whether the string pointed
to by s1 is greater than, equal to, or less than the string pointed to
by s2 when both are interpreted as appropriate to the current locale and
according to the above description. On error, streqcoll() may set errno,
but no return value is reserved to indicate an error.
ERRORS
The streqcoll() function may fail if:
[EINVAL]
The s1 or s2 arguments contain characters outside the domain of
the collating sequence.
NOTES
If streqcoll(s1, s2) < 0, then strcoll(s1, s2) < 0. Likewise,
if streqcoll(s1, s2) > 0, then strcoll(s1, s2) > 0.
===========================================================================
NAME
streqxfrm - string transformation using equivalence class collation
SYNOPSIS
#define _GNU_SOURCE
#include <string.h>
size_t streqxfrm(char *restrict s1, const char *restrict s2, size_t n);
DESCRIPTION
The streqxfrm() function shall transform the string pointed to by s2 and
place the resulting string into the array pointed to by s1. The
transformation is such that if strcmp() is applied to two transformed
strings, it shall return a value greater than, equal to, or less than 0,
corresponding to the result of streqcoll() applied to the same two
original strings. No more than n bytes are placed into the resulting
array pointed to by s1, including the terminating null byte. If n is 0,
s1 is permitted to be a null pointer. If copying takes place between
objects that overlap, the behavior is undefined.
The streqxfrm() function shall not change the setting of errno if
successful.
Since no return value is reserved to indicate an error, an application
wishing to check for error situations should set errno to 0, then call
streqxfrm(), then check errno.
RETURN VALUE
Upon successful completion, streqxfrm() shall return the length of the
transformed string (not including the terminating null byte). If the
value returned is n or more, the contents of the array pointed to by s1
are unspecified.
ERRORS
The streqxfrm() function may fail if:
[EINVAL]
The string pointed to by the s2 argument contains characters
outside the domain of the collating sequence.
===========================================================================
Ad 2): Do you need 2 functions, one for char * strings, and one for wide
strings here as well?
Yes, for example mbelemlen("ch") == 2 in Czech locales (or
mbelenlen("aa") == 2 in Danish locales), likewise for wcelemlen(L"ch")
or wcelemlen(L"aa").
However, this part is less important until
http://sourceware.org/bugzilla/show_bug.cgi?id=11561 is fixed upstream.
Again, here is a better description
NAME
wcelemlen - get number of wide characters in a collation element
SYNOPSIS
#include <wchar.h>
size_t wcelemlen(const wchar_t *restrict s, size_t n);
DESCRIPTION
wcelemlen() shall inspect at most n wide characters pointed to by ws to
determine the number of wide characters constituting the first collation
element in the string pointed to by ws, as described under "Collation
Order" in the POSIX standard.
The behavior of this function is affected by the LC_CTYPE and LC_COLLATE
categories of the current locale.
RETURN VALUE
The wcelemlen() function shall return the first of the following that
applies:
0
If n is zero or ws points to the null wide character.
between 1 and n
If the next n or fewer wide characters complete a valid collation
element; the value returned shall be the number of wide characters
that complete the collation element. If multiple collation elements
share a prefix in the current locale definition, the number of
bytes required to complete the longest collation element shall
be returned. For example, if the string "ch" is a collating
element defined using the line:
collating-element <ch-digraph> from "<c><h>"
in the locale definition, wcelemlen(L"ch", 2, NULL) shall
return 2 even though "c" also forms a valid collating element.
It is *not* an error if a longer collation element exists that has
the next n bytes as a prefix. For example, if the string "ch" is a
collating element defined as above, wcelemlen(L"c", 1, NULL) shall
return 1.
(size_t)-1
If the ws argument points to a wide-character code outside the
domain of the collating sequence, or includes such a code and the
characters before it do not form a complete collation element.
In this case, [EINVAL] shall be stored in errno.
ERRORS
[EINVAL]
The ws argument contains wide-character codes outside the domain of
the collating sequence.
===================================================================
NAME
mbelemlen - get number of bytes in a collation element
SYNOPSIS
#include <wchar.h>
size_t mbelemlen(const char *restrict s, size_t n,
mbstate_t *restrict ps);
DESCRIPTION
If s is a null pointer, the mbelemlen() function shall be equivalent to
the call:
mbrtowc(NULL, "", 1, ps)
In this case, the value of the argument n is ignored.
If s is not a null pointer, mbrlen() shall inspect at most n bytes
pointed to by s to determine the number of bytes constituting the first
collation element in the string pointed to by s, as described under
"Collation Order" in the POSIX standard.
If ps is a null pointer, the mbrlen() function shall use its own
internal mbstate_t object, which is initialized at program start-up to
the initial conversion state. Otherwise, the mbstate_t object pointed to
by ps shall be used to completely describe the current conversion state
of the associated character sequence. The implementation shall behave
as if no function defined in POSIX calls mbelemlen().
The behavior of this function is affected by the LC_CTYPE and LC_COLLATE
categories of the current locale.
RETURN VALUE
The mbelemlen() function shall return the first of the following that
applies:
0
If the next n or fewer bytes complete the character that
corresponds to the null wide character.
between 1 and n
If the next n or fewer bytes complete a valid collation
element; the value returned shall be the number of bytes that
complete the collation element. If multiple collation elements
share a prefix in the current locale definition, the number of
bytes required to complete the longest collation element shall
be returned. For example, if the string "ch" is a collating
element defined using the line:
collating-element <ch-digraph> from "<c><h>"
in the locale definition, mbelemlen("ch", 2, NULL) shall
return 2 even though "c" also forms a valid collating element.
(size_t)-2
If the next n bytes contribute to an incomplete but potentially
valid character, and all n bytes have been processed. When n
has at least the value of the {MB_CUR_MAX} macro, this case can
only occur if s points at a sequence of redundant shift
sequences (for implementations with state-dependent encodings).
An implementation shall *not* return (size_t)-2 if the next
n bytes contribute to a complete character, and a longer
collation element exists that has the next n bytes as a prefix.
For example, if the string "ch" is a collating element defined
as above, mbelemlen("c", 1, NULL) shall return 1 rather than
(size_t)-2.
(size_t)-1
If an encoding error occurs, in which case the next n or fewer
bytes do not contribute to a complete and valid character. In
this case, [EILSEQ] shall be stored in errno and the conversion
state is undefined.
(size_t)-1 shall also be returned if the first character decoded
from s falls outside the domain of the collating sequence, or
if such a character is decoded from s before a complete collation
element is formed. In this case, [EINVAL] shall be stored in errno.
ERRORS
The mbelemlen() function may fail if:
[EINVAL]
ps points to an object that contains an invalid conversion
state, or s contains characters outside the domain of the
collating sequence.
[EILSEQ]
An invalid character sequence is detected.
===========================================================================
Paolo