Re: [dev-servo] character encoding in the HTML parser

Keegan McAllister Tue, 22 Apr 2014 19:32:29 -0700

After some general performance improvements, I prepared a UTF-16 version [1] of 
the tokenizer, using Ms2ger's DOMString with some Rust updates and performance 
fixes [2], and benchmarked it against the UTF-8 version on master.  Even 
excluding the cost of UTF-8 to UTF-16 conversion, the UTF-16 tokenizer is about 
10% - 20% slower.  The exception is documents consisting of (for example) 
Chinese text with no markup, for which UTF-16 is significantly faster.  But 
real documents contain a lot of ASCII markup, whatever the text language.  I 
measured a 15% slowdown for http://sina.com.cn.

What of jgraham's point that parsing initiated by script is more 
performance-sensitive?  Here the proper comparison is a native UTF-16 parser 
vs. a conversion [3] from UTF-16 going into the UTF-8 parser.  The latter is 
about 25% - 55% slower on realistic HTML; I tested complete documents as well 
as small fragments you might get from innerHTML.  Tokenizing plain text is much 
slower with the conversion, but still fast in absolute terms (15 ms for 1 MB of 
text).

So this is a significant penalty to script-initiated parsing today.  But it 
sounds like there is interest in extending SpiderMonkey to use UTF-8, perhaps 
alongside UTF-16.  Servo seems like a good place to try this out.  And we might 
want a UTF-8 DOM anyway [4].  So I'd like to stick with UTF-8 in the parser.  
It's also more friendly to non-Servo users; ideally this library will be the 
standard HTML parser for Rust programs.

I also benchmarked against Hubbub's tokenizer.  It's hard to get an exact 
comparison because Hubbub doesn't copy the strings making up tokens; it just 
passes pointers into the original input.  (We could do this too, but the memory 
safety story is much more complicated.  My hope is that it's ultimately a wash 
because the strings will necessarily get copied at some point before they go 
into the DOM.)  Even so, we win on some benchmarks: we're 2.8x as fast as 
Hubbub on webapps.html.

I also experimented with using SSE 4.2 string instructions to accelerate 
tokenization.  In a standalone program [5] that looks for spaces, it gives more 
than a 10x speedup.  But my first attempt [6] to integrate this with the HTML 
tokenizer produced only a 5% - 7% overall improvement.  I think we can push 
this quite a bit higher by using more of the information from the fast search 
and by using it in more places.  (If you want to understand this code, look at 
[6] because it's the version with comments. :)  These instructions can handle 
UCS-2 as well, but I didn't try that yet.

keegan

[1] https://github.com/kmcallister/html5/tree/utf16-vec
[2] https://github.com/kmcallister/html5/blob/utf16-vec/src/util/domstring.rs
[3] https://github.com/kmcallister/html5/blob/utf16to8/bench/tokenizer.rs#L63
[4] https://github.com/mozilla/servo/issues/1880
[5] https://gist.github.com/kmcallister/11200458
[6] 
https://github.com/kmcallister/html5/blob/sse/src/tokenizer/buffer_queue.rs#L182-L252

----- Original Message -----
From: "Keegan McAllister" <[email protected]>
To: "Luke Wagner" <[email protected]>
Cc: "Henri Sivonen" <[email protected]>, "Boris Zbarsky" <[email protected]>, 
[email protected], "Robert O'Callahan" 
<[email protected]>, "Eddy Bruel" <[email protected]>
Sent: Tuesday, April 22, 2014 4:15:56 PM
Subject: Re: [dev-servo] character encoding in the HTML parser

You could use a rope where individual chunks can be either UTF-8 or UCS-2.  
UTF-8 strings would also record whether they happen to be 7-bit ASCII, and 
UCS-2 strings would record whether they contain any surrogates (and also maybe 
whether all surrogates are correctly paired according to UTF-16).  Together 
with the lengths stored throughout the rope's tree structure, this would enable 
efficient indexing by either UCS-2 or Unicode codepoint index.

That's a fair bit of complexity and so probably only worth it for large strings 
built in complicated ways.  I don't know whether these are common in the wild.

If you want to get really fancy, I think you could use succinct rank/select 
dictionaries or wavelet trees to make UTF-8 codepoint-indexable with minimal 
space overhead.  (But I don't actually know much about these data structures!)

keegan

----- Original Message -----
From: "Luke Wagner" <[email protected]>
To: "Henri Sivonen" <[email protected]>
Cc: "Boris Zbarsky" <[email protected]>, [email protected], 
"Robert O'Callahan" <[email protected]>, "Eddy Bruel" <[email protected]>
Sent: Thursday, April 3, 2014 9:02:38 AM
Subject: Re: [dev-servo] character encoding in the HTML parser

Another option we've just been discussing is to lazily compute a flag on the 
string indicating "contents are 7-bit ascii" that allowed us to use array 
indexing.  I'd expect this to often be true.  There are also many cases where 
we'd eagerly have this flag (atoms produced during parsing, strings converted 
from numbers, concatenations of 7-bit ascii strings, substrings of 7-bit ascii 
strings, as a parameter from the embedding, etc) so that we would be able to 
avoid much of the overhead of this check.  (One could even imagine a background 
thread that computed 7-bit-ness ;)

----- Original Message -----
> On Wed, Apr 2, 2014 at 4:25 PM, Robert O'Callahan <[email protected]>
> wrote:
> > If we could get the JS engine to use evil-UTF8 with some hack to handle
> > charAt and friends efficiently (e.g. tacking on a UCS-2 version of the
> > string when necessary)
> 
> Have we instrumented Gecko to find out what the access patterns are
> like? If the main performance-sensitive access pattern is sequential
> iteration over the string, instead of storing a UCS-2 copy, we could
> store the next expected UCS-2 index and the next UTF-8 index. charAt
> would then start by comparing if its argument equals the next expected
> UCS-2 index in which case read would start at the next UTF-8 index.
> 
> --
> Henri Sivonen
> [email protected]
> https://hsivonen.fi/
> _______________________________________________
> dev-servo mailing list
> [email protected]
> https://lists.mozilla.org/listinfo/dev-servo
> 
_______________________________________________
dev-servo mailing list
[email protected]
https://lists.mozilla.org/listinfo/dev-servo
_______________________________________________
dev-servo mailing list
[email protected]
https://lists.mozilla.org/listinfo/dev-servo
_______________________________________________
dev-servo mailing list
[email protected]
https://lists.mozilla.org/listinfo/dev-servo

Re: [dev-servo] character encoding in the HTML parser

Reply via email to