On 2025-01-03 23:39, Paul Eggert wrote:
Don't we have problems with mbs_startswith, though? If the prefix ends in an incomplete multibyte character (an encoding error), the current code can match that to part of a multibyte character in the string. This doesn't match what you'd get if you ran mbiter on both prefix and string and matched each component you found.
Come to think of it, isn't there a different problem with mbs_startswith? As I recall, mbiter supports GB18030, which has the unfortunate property that an indivisible sequence of encoding bytes stands for two Unicode characters, which means that mbiter needs to parse the sequence and remember the second character while delivering the first; the next time you call mbiter it makes no progress in the input bytes and delivers the second character.
If such an encoding sequence stands for the characters A B, then mbs_startswith ("AB", "A") will fail if "AB" is represented as an indivisible byte sequence.
(This problem can't happen with mbcel, which does not support such encodings.)
Does the latest version of GB18030 still have this unfortunate property? If not, this problem goes away (and perhaps mbiter's support for such encodings can also go away).