Re: From wchar_t to char32_t

Bruno Haible Mon, 10 Jul 2023 08:11:33 -0700

Regarding my proposed 'dfa' module patch:
Paul Eggert wrote on 2023-07-04:
> > -      wchar_t wch;
> > -      size_t nbytes = mbrtowc (&wch, s, n, &d->mbs);
> > +      char32_t wch;
> > +      size_t nbytes = mbrtoc32 (&wch, s, n, &d->mbs);
> >        if (0 < nbytes && nbytes < (size_t) -2)
> >          {
> >            *pwc = wch;
> > +          if (nbytes == (size_t) -3)
> > +            nbytes = 0;
> >            return nbytes;
> 
> That last change doesn't match the comment for the mbs_to_wchar 
> function, which says that the function always returns a positive int. 
> Callers depend on this.


Indeed, the function fetch_wc and its callers expect that the 'wcstok'
field contains an integer that represents the multibyte sequence as a whole.

Fundamentally the problem is that in a character range in a regex
  [ MB1 MB2 ... ]
the multibyte sequence boundaries are also the parse boundaries. If MB1
gets transformed to two Unicode characters U1 U2, the character range
  [ U1 U2 MB2 ... ]
is something very different.

So, in locales where the locale encoding is BIG5-HKSCS we would have a
problem. We would need to distinguish application uses where it is OK
to split a character into several Unicode characters (such as for
computing the total width — mbswidth.c) and application uses where
the multibyte character must be kept together, with a Unicode-side
representation of several Unicode characters.

Bruno

Re: From wchar_t to char32_t

Reply via email to