Re: [dev-servo] character encoding in the HTML parser

2014-04-22 Thread Keegan McAllister
After some general performance improvements, I prepared a UTF-16 version [1] of the tokenizer, using Ms2ger's DOMString with some Rust updates and performance fixes [2], and benchmarked it against the UTF-8 version on master. Even excluding the cost of UTF-8 to UTF-16 conversion, the UTF-16 tok

Re: [dev-servo] character encoding in the HTML parser

2014-04-22 Thread Keegan McAllister
You could use a rope where individual chunks can be either UTF-8 or UCS-2. UTF-8 strings would also record whether they happen to be 7-bit ASCII, and UCS-2 strings would record whether they contain any surrogates (and also maybe whether all surrogates are correctly paired according to UTF-16).