After some general performance improvements, I prepared a UTF-16 version [1] of
the tokenizer, using Ms2ger's DOMString with some Rust updates and performance
fixes [2], and benchmarked it against the UTF-8 version on master. Even
excluding the cost of UTF-8 to UTF-16 conversion, the UTF-16 tok
You could use a rope where individual chunks can be either UTF-8 or UCS-2.
UTF-8 strings would also record whether they happen to be 7-bit ASCII, and
UCS-2 strings would record whether they contain any surrogates (and also maybe
whether all surrogates are correctly paired according to UTF-16).
2 matches
Mail list logo