Paul Eggert wrote in <https://lists.gnu.org/archive/html/bug-gnulib/2023-07/msg00145.html>:
> Here are the results: > > noop mbiterf mbuiterf mbcel mbucel > a 1.546 2.038 1.931 1.639 1.968 > b 1.545 2.016 1.936 1.705 1.969 > c 2.153 8.160 8.255 6.160 5.550 > d 2,153 6,692 6,686 6,322 5,520 > e 2.234 6.866 6.817 6.214 5.781 > f 1.843 95.276 104.270 58.667 60.902 > g 1,844 76,126 83,899 65,541 66,983 > h 3.297 75.727 83.735 64.261 65.686 > i 1.941 31.064 34.887 26.262 27.072 > j 1.310 29.620 33.698 24.751 25.550 To me, the columns timings of mbiterf and mbuiterf are good enough. The rows c and f will improve when glibc's C locale bug is fixed. For the rest, I see the timings as close enough to the optimum. > But in another sense these benchmarks are biased against mbcel, as this > benchmark compiles in a special environment that defines NDEBUG, which > most Gnulib-using code doesn't do Yes, it's so tempting to take out all assert()s from production code :) But I won't do it nevertheless, because I fear platform bugs in the mb* code. > For applications that need MEE, it'd be easy for mbcel to supply it. > Something like the following patch would do it, though it should be a > different function (e.g., mbcel_scanmee) instead of being a change to > mbcel_scan itself. However, diffutils doesn't need to bother with such a > change as it can use SEE like most other programs do. > > --- a/lib/mbcel.h > +++ b/lib/mbcel.h > @@ -191,3 +191,3 @@ mbcel_scan (char const *p, char const *lim) > if (_GL_UNLIKELY ((size_t) -1 / 2 < len)) > - return (mbcel_t) { .err = *p, .len = 1 }; > + return (mbcel_t) { .err = *p, .len = len == (size_t) -2 ? lim - p : 1 > }; With a patch like this, i.e. with MEE, I would welcome mbcel and mbucel in Gnulib. (Although maybe you may want to align the module name to be similar to mbiterf and mbuiterf : maybe mbitervf and mbuitervf for "very fast"?) > SEE is simpler and is common practice elsewhere, notably Emacs. Emacs is a complex beast. I can understand if the Emacs developers want an implementationally-simple behaviour, rather than a simple-from-the- user-perspective behaviour. But other GNU programs can and should do it better, since modules 'mb*iter*f' already exist that hide the implementational complexity. Paul Eggert wrote in <https://lists.gnu.org/archive/html/bug-gnulib/2023-07/msg00149.html>: > It would be better to change mbu?iterf? to use > single-byte-per-encoding-error ("SEE") behavior No. No way. This would be a regression. > This is because mbu?iterf? doesn't implement MEE either. > > For MEE, mbiterf would need something like the attached untested patch, > and mbiter, mbcel, etc. would all need similar patches. Good point. Actually, with iconv() based converters one needs such a loop, because iconv's calling conventions don't allow to return detailed information. (See iconv_carefully and iconv_carefully_1 in lib/striconveh.c.) mbrtowc / mbrtoc32 is a bit different, but since upon return it does not give detailed information about the structure of the invalid byte sequence, it may indeed need such a loop. For comparison, <unistr.h> u8_mbtouc returns the number of bytes, but OTOH does not allow to distinguish U+FFFD vs. incomplete/invalid byte sequence. I'll think more about this one... Bruno