Re: From wchar_t to char32_t

2023-07-04 Thread Paul Eggert
On 2023-07-01 07:35, Bruno Haible wrote: - wchar_t wch; - size_t nbytes = mbrtowc (&wch, s, n, &d->mbs); + char32_t wch; + size_t nbytes = mbrtoc32 (&wch, s, n, &d->mbs); if (0 < nbytes && nbytes < (size_t) -2) { *pwc = wch; + if (nbytes =

Re: From wchar_t to char32_t

2023-07-04 Thread Jim Meyering
On Sat, Jul 1, 2023 at 7:35 AM Bruno Haible wrote: > > Here is a proposed patch to overcome the wchar_t limitation in the 'dfa' > module. > > Jim: The background is explained in > > The plan was exposed in >

Re: From wchar_t to char32_t

2023-07-04 Thread Paul Eggert
On 2023-07-04 12:31, Bruno Haible wrote: Yes. As far as I can see, this proposed patch should cope with (size_t) -3 returns correctly. I still see a couple of problems with it. First, it mishandles the case where mbrtoc32 returns 0, which ISO C allows. Second and more interestingly, its "fw

Re: From wchar_t to char32_t

2023-07-04 Thread Bruno Haible
[CCing diffutils-devel.] Paul Eggert wrote in : > >Level 3: Behave correctly. Don't split a 2-Unicode-character sequence. > > This is what code that uses mbrtoc32() does, when it has the > > lines > >

Re: From wchar_t to char32_t

2023-07-04 Thread Bruno Haible
I wrote: > Level 2: Behave correctly, except that a 2-Unicode-character sequence >may be split although it shouldn't. >This is what code that uses mbrtoc32() does, when it has the >lines > if (bytes == (size_t) -3) > bytes = 0;

mbiter, mbuiter: Improve state handling after invalid input

2023-07-04 Thread Bruno Haible
After mbrtowc or mbrtoc32 failed with return value (size_t) -1, we don't know in which state the mbstate_t is. Therefore it's best to clear it before potentially calling mbrtowc or mbrtoc32 again. 2023-07-04 Bruno Haible mbiter, mbuiter, mbfile: Improve state handling after invalid in

Re: proposed performance tweaks to Gnulib mbchar module

2023-07-04 Thread Bruno Haible
Hi Paul, Paul Eggert wrote: > Attached are two proposed performance tweaks I found by inspection. No > big deal of course. Thanks. I committed the first one in your name, with a reference to the precise ISO C section. Then I found that POSIX's "Portable character set" goes beyond that, and thus