> Unfortunately I’d be less surprised if someone relies on having the two > halves of a surrogate pair in separate document.write() call, as this > seems more interoperable: > > data:text/html,<script>document.write("\uD83D");document.write("\uDCA9")</script>
The tokenizer's input is a queue of buffers, and I'm imagining that document.write will insert a new buffer into that queue (at the script's insertion point) without modifying existing buffers. In that case we can track an optional trailing high surrogate and/or leading low surrogate for each buffer, with the rest of the content as UTF-8. This is a hack but seems pretty straightforward. At that point are there any remaining compat issues? It does seem like replacing truly lone surrogates with U+FFFD would be an acceptable deviation from the spec, but maybe we want to avoid those absolutely. > Or I guess we could use what I’ll call "evil UTF-8", which is UTF-8 > without the artificial restriction of not encoding surrogates. We'd lose the ability to use Rust's native ~str, so I think it's not worth it unless we can demonstrate a significant performance win from a smaller and ASCII-compatible encoding. > I believe that in load-type situations the parser is almost never > the performance limiting factor... > On the other hand I think that there exist pages which do things > like run innerHTML in performance-critical loops. Therefore I expect it > makes more sense for the parser to operate on the same string type as > the DOM than the same type as the network data. Good point. Even if the DOM is a mix of UTF-8 lazily converted to UCS-2, the argument to document.write or the innerHTML setter is a JS string. I was thinking about network data as the bulk of the parser input by volume, but you're right that performance requirements differ between net and script. Also I've worked on webapps that produced the majority of their HTML from client-side templates in JS; this may be fairly common by now. Since the tokenizer is ≈complete and passing tests, perhaps I'll prepare a branch which uses UTF-16 strings so I can compare performance. I haven't profiled or optimized the tokenizer at all yet, so I'll need to do some of that first to get relevant results. It might be that a UTF-8 parser is no faster, but still worth it for the optimization of keeping DOM strings (especially long text nodes) as UTF-8 until touched by script. I don't know how we would easily test that at this stage. Much as I'd like to use UTF-8 on principle, it does seem doubtful that practical benefits will justify shoehorning it into a platform which is basically incompatible :/ keegan _______________________________________________ dev-servo mailing list dev-servo@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-servo