Paul Eggert wrote on 2023-07-06: > in reviewing it found a minor > glitch or two and some opportunities for simplification. I installed the > attached further patch which I hope fixes glitches without breaking > anything else.
Comments: - Typo: s/mbrtoc23/mbrtoc32/ - The rationale for defining and initializing the mbstate_t at the function scope was that on BSD and macOS systems, an mbstate_t is 128 bytes large, thus the time to zero-initialize is not negligible. The code with minimal-scope mbstate_t is clearer, but slower on BSD systems (assuming a string with many switches between ASCII and non-ASCII characters). OTOH, on a purely ASCII string, it's obviously faster to not initialize an mbstate_t than to initialize it. > One other thing I discovered in my review. POSIX says that 'diff' need > not support locking-shift sequences[1], and this business of mbrtoc23 > returning (size_t) -3 is in a murky area as it would appear to fall into > the locking-shift sequence category (at any rate, it doesn't appear to > be a single-shift encoding which is POSIX's only other option for > state-dependent encodings). Or maybe the next version of POSIX will have > to change in this area? > [1]: > https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap06.html#tag_06_02 I think this wording regarding "single-shift" sequences and "locking-shift" sequences is more than 20 years old: - "single-shift" encodings are encodings such as EUC-JP. Before 2000, some people viewed them as encodings with "shift". This paragraph is merely a clarification that it's better to view these encodings as normal multibyte encodings without shift. - "locking-shift" encodings are things like ISO-2022-JP-2. Around 1999, some people were experimenting with a hacked Linux libc that used this encoding as a locale encoding. Of course, the resulting system was full of bugs, because even simple operations such as concatenating two directory names sometimes produced wrong results. And I'm not even talking about the missing normalization of file names... So, since 2000, there is an overall agreement that "locking-shift" encodings are not usable as locale encodings. They are merely usable with the 'iconv' facility. POSIX does not have a term for the type of encoding that BIG5-HKSCS is, where an (indivisible) multibyte-sequence maps to a sequence of 2 Unicode characters. Bruno