Re: [dev-servo] character encoding in the HTML parser

Keegan McAllister Mon, 31 Mar 2014 19:02:24 -0700

> Unfortunately I’d be less surprised if someone relies on having the two 
> halves of a surrogate pair in separate document.write() call, as this 
> seems more interoperable:
>
> data:text/html,<script>document.write("\uD83D");document.write("\uDCA9")</script>


The tokenizer's input is a queue of buffers, and I'm imagining that 
document.write will insert a new buffer into that queue (at the script's 
insertion point) without modifying existing buffers.  In that case we can track 
an optional trailing high surrogate and/or leading low surrogate for each 
buffer, with the rest of the content as UTF-8.  This is a hack but seems pretty 
straightforward.

At that point are there any remaining compat issues?  It does seem like 
replacing truly lone surrogates with U+FFFD would be an acceptable deviation 
from the spec, but maybe we want to avoid those absolutely.


> Or I guess we could use what I’ll call "evil UTF-8", which is UTF-8 
> without the artificial restriction of not encoding surrogates.

We'd lose the ability to use Rust's native ~str, so I think it's not worth it 
unless we can demonstrate a significant performance win from a smaller and 
ASCII-compatible encoding.


> I believe that in load-type situations the  parser is almost never
> the performance limiting factor...
> On the other hand I think that there exist pages which do things 
> like run innerHTML in performance-critical loops. Therefore I expect it 
> makes more sense for the parser to operate on the same string type as 
> the DOM than the same type as the network data.

Good point.  Even if the DOM is a mix of UTF-8 lazily converted to UCS-2, the 
argument to document.write or the innerHTML setter is a JS string.

I was thinking about network data as the bulk of the parser input by volume, 
but you're right that performance requirements differ between net and script.  
Also I've worked on webapps that produced the majority of their HTML from 
client-side templates in JS; this may be fairly common by now.

Since the tokenizer is ≈complete and passing tests, perhaps I'll prepare a 
branch which uses UTF-16 strings so I can compare performance.  I haven't 
profiled or optimized the tokenizer at all yet, so I'll need to do some of that 
first to get relevant results.

It might be that a UTF-8 parser is no faster, but still worth it for the 
optimization of keeping DOM strings (especially long text nodes) as UTF-8 until 
touched by script.  I don't know how we would easily test that at this stage.


Much as I'd like to use UTF-8 on principle, it does seem doubtful that 
practical benefits will justify shoehorning it into a platform which is 
basically incompatible :/

keegan
_______________________________________________
dev-servo mailing list
dev-servo@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-servo

Re: [dev-servo] character encoding in the HTML parser

Reply via email to