Regarding my proposed 'dfa' module patch: Paul Eggert wrote on 2023-07-04: > > - wchar_t wch; > > - size_t nbytes = mbrtowc (&wch, s, n, &d->mbs); > > + char32_t wch; > > + size_t nbytes = mbrtoc32 (&wch, s, n, &d->mbs); > > if (0 < nbytes && nbytes < (size_t) -2) > > { > > *pwc = wch; > > + if (nbytes == (size_t) -3) > > + nbytes = 0; > > return nbytes; > > That last change doesn't match the comment for the mbs_to_wchar > function, which says that the function always returns a positive int. > Callers depend on this.
Indeed, the function fetch_wc and its callers expect that the 'wcstok' field contains an integer that represents the multibyte sequence as a whole. Fundamentally the problem is that in a character range in a regex [ MB1 MB2 ... ] the multibyte sequence boundaries are also the parse boundaries. If MB1 gets transformed to two Unicode characters U1 U2, the character range [ U1 U2 MB2 ... ] is something very different. So, in locales where the locale encoding is BIG5-HKSCS we would have a problem. We would need to distinguish application uses where it is OK to split a character into several Unicode characters (such as for computing the total width — mbswidth.c) and application uses where the multibyte character must be kept together, with a Unicode-side representation of several Unicode characters. Bruno