Re: implementing extended bracket expressions in gnulib [was Re: Dealing with character ranges in grep]

Paolo Bonzini Thu, 09 Jun 2011 06:10:24 -0700

On 06/09/2011 01:53 PM, Bruno Haible wrote:

Paolo,

My proposal wouldn't change defaults, which is why I believe that this
is a separate topic.


But at the same time you are pushing for the use of --with-included-regex.
We found out that by doing this, the equivalence classes feature gets lost,
and the divergence between glibc and gnulib becomes greater.


I'm not pushing for that, though probably Karl and Aharon would be. :)

It can also make --with-included-regex the default on its own.

That would be bad IMO. It doesn't matter whether we split the processin 1, 5 or 10 steps; what matters is that at each point all maintainersagree on *not* changing the default for --with-included-regex. That'swhy I want a good solution done fast, then we can perfect that solutionlater.

We'd need glibc to export two functions in both multi-byte and
wide-character versions:

1) streqcoll(S1, S2) and wcseqcoll(S1, S2) would be the same as strcoll
and wcscoll, but they would compare only according to primary weights.
A slightly more formal definition is that streqcoll(S1, S2) == 0 iff S1
matches the \`[=C1=][=C2=][=C3=]...[=Cn=]\' regular expression, where Ci
are the characters of S2 (I'd need to double check this against POSIX
though).  When non-zero, the result of streqcoll(S1, S2) would be the
same as strcoll(S1, S2).  Likewise, glibc could provide streqxfrm and
wcseqxfrm, with the definition that strcmp(streqxfrm(S1), streqxfrm(S2))
== streqcoll(S1, S2).

2) On top of this, [.ss.] could be implemented using an additional
function mbelemlen(S) giving the length of the first collation element
in S.  [.S1.] would be rejected unless mbelemlen(S1) == strlen(S1), and
[.S1.] would match S2 if strcoll(S1, strndup(S2, mbelemlen(S2))) == 0.
wcelemlen could be provided likewise.

These are the minimal extensions that would be required to support full
regular expression features portably and in a manner that is compatible
with glibc, except for ranges


Great! These look like a good basis for discussing with the glibc people.

Ad 1): Is streqcoll symmetric? That is, is streqcoll(S1, S2) the same as
streqcoll(S2, S1)?

Yes. It's really an equivalence relation, in the mathematical sense.Here are definitions like they might appear in POSIX (I'm omittingwcseqcoll and wcseqxfrm).


===========================================================================

NAME

streqcoll - string comparison using equivalence class collation

SYNOPSIS

#define _GNU_SOURCE
#include <string.h>

int streqcoll(const char *s1, const char *s2);

DESCRIPTION

The streqcoll() function shall compare the string pointed to by s1 to
the string pointed to by s2, both interpreted as appropriate to the
LC_COLLATE category of the current locale. Each character is only
compared according to its primary equivalence class, as described under
"Collation Order" in the POSIX standard.

The streqcoll() function shall not change the setting of errno if
successful.

Since no return value is reserved to indicate an error, an application
wishing to check for error situations should set errno to 0, then call
streqcoll(), then check errno.

RETURN VALUE

Upon successful completion, streqcoll() shall return an integer greater
than, equal to, or less than 0, according to whether the string pointed
to by s1 is greater than, equal to, or less than the string pointed to
by s2 when both are interpreted as appropriate to the current locale and
according to the above description. On error, streqcoll() may set errno,
but no return value is reserved to indicate an error.

ERRORS

The streqcoll() function may fail if:

[EINVAL]
    The s1 or s2 arguments contain characters outside the domain of
    the collating sequence.

NOTES

If streqcoll(s1, s2) < 0, then strcoll(s1, s2) < 0.  Likewise,
if streqcoll(s1, s2) > 0, then strcoll(s1, s2) > 0.

===========================================================================

NAME

streqxfrm - string transformation using equivalence class collation

SYNOPSIS

#define _GNU_SOURCE
#include <string.h>

size_t streqxfrm(char *restrict s1, const char *restrict s2, size_t n);

DESCRIPTION

The streqxfrm() function shall transform the string pointed to by s2 andplace the resulting string into the array pointed to by s1. Thetransformation is such that if strcmp() is applied to two transformedstrings, it shall return a value greater than, equal to, or less than 0,corresponding to the result of streqcoll() applied to the same twooriginal strings. No more than n bytes are placed into the resultingarray pointed to by s1, including the terminating null byte. If n is 0,s1 is permitted to be a null pointer. If copying takes place betweenobjects that overlap, the behavior is undefined.


The streqxfrm() function shall not change the setting of errno if
successful.

Since no return value is reserved to indicate an error, an application
wishing to check for error situations should set errno to 0, then call
streqxfrm(), then check errno.

RETURN VALUE

Upon successful completion, streqxfrm() shall return the length of thetransformed string (not including the terminating null byte). If thevalue returned is n or more, the contents of the array pointed to by s1are unspecified.


ERRORS

The streqxfrm() function may fail if:

[EINVAL]
    The string pointed to by the s2 argument contains characters
    outside the domain of the collating sequence.

===========================================================================

Ad 2): Do you need 2 functions, one for char * strings, and one for wide
strings here as well?

Yes, for example mbelemlen("ch") == 2 in Czech locales (ormbelenlen("aa") == 2 in Danish locales), likewise for wcelemlen(L"ch")or wcelemlen(L"aa").

However, this part is less important untilhttp://sourceware.org/bugzilla/show_bug.cgi?id=11561 is fixed upstream.


Again, here is a better description

NAME

    wcelemlen - get number of wide characters in a collation element

SYNOPSIS

#include <wchar.h>

size_t wcelemlen(const wchar_t *restrict s, size_t n);

DESCRIPTION

wcelemlen() shall inspect at most n wide characters pointed to by ws todetermine the number of wide characters constituting the first collationelement in the string pointed to by ws, as described under "CollationOrder" in the POSIX standard.

The behavior of this function is affected by the LC_CTYPE and LC_COLLATEcategories of the current locale.


RETURN VALUE

The wcelemlen() function shall return the first of the following thatapplies:


0
    If n is zero or ws points to the null wide character.

between 1 and n
    If the next n or fewer wide characters complete a valid collation
    element; the value returned shall be the number of wide characters
    that complete the collation element.  If multiple collation elements
    share a prefix in the current locale definition, the number of
    bytes required to complete the longest collation element shall
    be returned.  For example, if the string "ch" is a collating
    element defined using the line:

       collating-element <ch-digraph> from "<c><h>"

    in the locale definition, wcelemlen(L"ch", 2, NULL) shall
    return 2 even though "c" also forms a valid collating element.
    It is *not* an error if a longer collation element exists that has
    the next n bytes as a prefix.  For example, if the string "ch" is a
    collating element defined as above, wcelemlen(L"c", 1, NULL) shall
    return 1.

(size_t)-1
    If the ws argument points to a wide-character code outside the
    domain of the collating sequence, or includes such a code and the
    characters before it do not form a complete collation element.
    In this case, [EINVAL] shall be stored in errno.

ERRORS

[EINVAL]
    The ws argument contains wide-character codes outside the domain of
    the collating sequence.

===================================================================

NAME

    mbelemlen - get number of bytes in a collation element

SYNOPSIS

#include <wchar.h>

size_t mbelemlen(const char *restrict s, size_t n,
       mbstate_t *restrict ps);

DESCRIPTION

If s is a null pointer, the mbelemlen() function shall be equivalent tothe call:


    mbrtowc(NULL, "", 1, ps)

In this case, the value of the argument n is ignored.

If s is not a null pointer, mbrlen() shall inspect at most n bytespointed to by s to determine the number of bytes constituting the firstcollation element in the string pointed to by s, as described under"Collation Order" in the POSIX standard.

If ps is a null pointer, the mbrlen() function shall use its owninternal mbstate_t object, which is initialized at program start-up tothe initial conversion state. Otherwise, the mbstate_t object pointed toby ps shall be used to completely describe the current conversion stateof the associated character sequence. The implementation shall behaveas if no function defined in POSIX calls mbelemlen().

The behavior of this function is affected by the LC_CTYPE and LC_COLLATEcategories of the current locale.


RETURN VALUE

The mbelemlen() function shall return the first of the following thatapplies:


0
    If the next n or fewer bytes complete the character that
    corresponds to the null wide character.

between 1 and n
    If the next n or fewer bytes complete a valid collation
    element; the value returned shall be the number of bytes that
    complete the collation element.  If multiple collation elements
    share a prefix in the current locale definition, the number of
    bytes required to complete the longest collation element shall
    be returned.  For example, if the string "ch" is a collating
    element defined using the line:

       collating-element <ch-digraph> from "<c><h>"

    in the locale definition, mbelemlen("ch", 2, NULL) shall
    return 2 even though "c" also forms a valid collating element.

(size_t)-2
    If the next n bytes contribute to an incomplete but potentially
    valid character, and all n bytes have been processed. When n
    has at least the value of the {MB_CUR_MAX} macro, this case can
    only occur if s points at a sequence of redundant shift
    sequences (for implementations with state-dependent encodings).

    An implementation shall *not* return (size_t)-2 if the next
    n bytes contribute to a complete character, and a longer
    collation element exists that has the next n bytes as a prefix.
    For example, if the string "ch" is a collating element defined
    as above, mbelemlen("c", 1, NULL) shall return 1 rather than
    (size_t)-2.

(size_t)-1
    If an encoding error occurs, in which case the next n or fewer
    bytes do not contribute to a complete and valid character. In
    this case, [EILSEQ] shall be stored in errno and the conversion
    state is undefined.

    (size_t)-1 shall also be returned if the first character decoded
    from s falls outside the domain of the collating sequence, or
    if such a character is decoded from s before a complete collation
    element is formed.  In this case, [EINVAL] shall be stored in errno.

ERRORS

The mbelemlen() function may fail if:

[EINVAL]
    ps points to an object that contains an invalid conversion
    state, or s contains characters outside the domain of the
    collating sequence.

[EILSEQ]
    An invalid character sequence is detected.

===========================================================================

Paolo

Re: implementing extended bracket expressions in gnulib [was Re: Dealing with character ranges in grep]

Reply via email to