After some general performance improvements, I prepared a UTF-16 version [1] of the tokenizer, using Ms2ger's DOMString with some Rust updates and performance fixes [2], and benchmarked it against the UTF-8 version on master. Even excluding the cost of UTF-8 to UTF-16 conversion, the UTF-16 tokenizer is about 10% - 20% slower. The exception is documents consisting of (for example) Chinese text with no markup, for which UTF-16 is significantly faster. But real documents contain a lot of ASCII markup, whatever the text language. I measured a 15% slowdown for http://sina.com.cn.
What of jgraham's point that parsing initiated by script is more performance-sensitive? Here the proper comparison is a native UTF-16 parser vs. a conversion [3] from UTF-16 going into the UTF-8 parser. The latter is about 25% - 55% slower on realistic HTML; I tested complete documents as well as small fragments you might get from innerHTML. Tokenizing plain text is much slower with the conversion, but still fast in absolute terms (15 ms for 1 MB of text). So this is a significant penalty to script-initiated parsing today. But it sounds like there is interest in extending SpiderMonkey to use UTF-8, perhaps alongside UTF-16. Servo seems like a good place to try this out. And we might want a UTF-8 DOM anyway [4]. So I'd like to stick with UTF-8 in the parser. It's also more friendly to non-Servo users; ideally this library will be the standard HTML parser for Rust programs. I also benchmarked against Hubbub's tokenizer. It's hard to get an exact comparison because Hubbub doesn't copy the strings making up tokens; it just passes pointers into the original input. (We could do this too, but the memory safety story is much more complicated. My hope is that it's ultimately a wash because the strings will necessarily get copied at some point before they go into the DOM.) Even so, we win on some benchmarks: we're 2.8x as fast as Hubbub on webapps.html. I also experimented with using SSE 4.2 string instructions to accelerate tokenization. In a standalone program [5] that looks for spaces, it gives more than a 10x speedup. But my first attempt [6] to integrate this with the HTML tokenizer produced only a 5% - 7% overall improvement. I think we can push this quite a bit higher by using more of the information from the fast search and by using it in more places. (If you want to understand this code, look at [6] because it's the version with comments. :) These instructions can handle UCS-2 as well, but I didn't try that yet. keegan [1] https://github.com/kmcallister/html5/tree/utf16-vec [2] https://github.com/kmcallister/html5/blob/utf16-vec/src/util/domstring.rs [3] https://github.com/kmcallister/html5/blob/utf16to8/bench/tokenizer.rs#L63 [4] https://github.com/mozilla/servo/issues/1880 [5] https://gist.github.com/kmcallister/11200458 [6] https://github.com/kmcallister/html5/blob/sse/src/tokenizer/buffer_queue.rs#L182-L252 ----- Original Message ----- From: "Keegan McAllister" <[email protected]> To: "Luke Wagner" <[email protected]> Cc: "Henri Sivonen" <[email protected]>, "Boris Zbarsky" <[email protected]>, [email protected], "Robert O'Callahan" <[email protected]>, "Eddy Bruel" <[email protected]> Sent: Tuesday, April 22, 2014 4:15:56 PM Subject: Re: [dev-servo] character encoding in the HTML parser You could use a rope where individual chunks can be either UTF-8 or UCS-2. UTF-8 strings would also record whether they happen to be 7-bit ASCII, and UCS-2 strings would record whether they contain any surrogates (and also maybe whether all surrogates are correctly paired according to UTF-16). Together with the lengths stored throughout the rope's tree structure, this would enable efficient indexing by either UCS-2 or Unicode codepoint index. That's a fair bit of complexity and so probably only worth it for large strings built in complicated ways. I don't know whether these are common in the wild. If you want to get really fancy, I think you could use succinct rank/select dictionaries or wavelet trees to make UTF-8 codepoint-indexable with minimal space overhead. (But I don't actually know much about these data structures!) keegan ----- Original Message ----- From: "Luke Wagner" <[email protected]> To: "Henri Sivonen" <[email protected]> Cc: "Boris Zbarsky" <[email protected]>, [email protected], "Robert O'Callahan" <[email protected]>, "Eddy Bruel" <[email protected]> Sent: Thursday, April 3, 2014 9:02:38 AM Subject: Re: [dev-servo] character encoding in the HTML parser Another option we've just been discussing is to lazily compute a flag on the string indicating "contents are 7-bit ascii" that allowed us to use array indexing. I'd expect this to often be true. There are also many cases where we'd eagerly have this flag (atoms produced during parsing, strings converted from numbers, concatenations of 7-bit ascii strings, substrings of 7-bit ascii strings, as a parameter from the embedding, etc) so that we would be able to avoid much of the overhead of this check. (One could even imagine a background thread that computed 7-bit-ness ;) ----- Original Message ----- > On Wed, Apr 2, 2014 at 4:25 PM, Robert O'Callahan <[email protected]> > wrote: > > If we could get the JS engine to use evil-UTF8 with some hack to handle > > charAt and friends efficiently (e.g. tacking on a UCS-2 version of the > > string when necessary) > > Have we instrumented Gecko to find out what the access patterns are > like? If the main performance-sensitive access pattern is sequential > iteration over the string, instead of storing a UCS-2 copy, we could > store the next expected UCS-2 index and the next UTF-8 index. charAt > would then start by comparing if its argument equals the next expected > UCS-2 index in which case read would start at the next UTF-8 index. > > -- > Henri Sivonen > [email protected] > https://hsivonen.fi/ > _______________________________________________ > dev-servo mailing list > [email protected] > https://lists.mozilla.org/listinfo/dev-servo > _______________________________________________ dev-servo mailing list [email protected] https://lists.mozilla.org/listinfo/dev-servo _______________________________________________ dev-servo mailing list [email protected] https://lists.mozilla.org/listinfo/dev-servo _______________________________________________ dev-servo mailing list [email protected] https://lists.mozilla.org/listinfo/dev-servo

