On 2012-10-10 9:03 PM, Johnny Stenback wrote:
Hey Henri,
On 10/10/2012 5:51 AM, Henri Sivonen wrote:
I am researching/prototyping a translation of the same HTML parser we
use in Gecko into Rust for use in Servo. Should the HTML parser in
Servo operate on UTF-8, UTF-16 or CESU-8? What will the DOM in Servo
use internally?
Excellent question. This is something that has been discussed before
(there was a discussion at the previous DOM/WebAPI work week about
this), and it's my personal belief that having the DOM use anything
other than what JS requires, i.e. UTF-16 (or UCS2, really), is a
non-starter from a performance point of view. As you point out below,
having to copy string data on every string access from JS would kill us
on performance. Ironically we often do end up doing that conversion in
the current DOM implementation, but AFAIK that's fairly isolated to
accessing the text value of a DOM text nodes. I'd argue that that's a
relatively rare operation compared to accessing property/attribute
values during tree iteration etc. That I think is where it matters.
I'm not seeing why the JS engine has to use any particular
representation internally just because JS's exposed semantics are
defined in terms of UCS-2. In fact, storing strings internally in UTF-8
and mapping them to UCS-2 only when it actually makes a difference to
the program might well be *faster* than storing everything internally in
UTF-16. And it would pave the way to providing alternative JS
functionality that doesn't expose the nastier parts of UCS-2 (e.g.
surrogate pairs).
(This presumes that part of Servo involves reimplementing the JS engine
in Rust anyway. Without that, I can imagine this being far too much
work to bother with :)
zw
___
dev-servo mailing list
dev-servo@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-servo