Re: [dev-servo] HTML parser-related datatypes

Johnny Stenback Wed, 10 Oct 2012 18:03:48 -0700

Hey Henri,

On 10/10/2012 5:51 AM, Henri Sivonen wrote:
> I am researching/prototyping a translation of the same HTML parser we
> use in Gecko into Rust for use in Servo. Should the HTML parser in
> Servo operate on UTF-8, UTF-16 or CESU-8? What will the DOM in Servo
> use internally?


Excellent question. This is something that has been discussed before
(there was a discussion at the previous DOM/WebAPI work week about
this), and it's my personal belief that having the DOM use anything
other than what JS requires, i.e. UTF-16 (or UCS2, really), is a
non-starter from a performance point of view. As you point out below,
having to copy string data on every string access from JS would kill us
on performance. Ironically we often do end up doing that conversion in
the current DOM implementation, but AFAIK that's fairly isolated to
accessing the text value of a DOM text nodes. I'd argue that that's a
relatively rare operation compared to accessing property/attribute
values during tree iteration etc. That I think is where it matters.

> It appears that currently Servo uses UTF-8 contained in Rust strings
> internally, but is that going to stay? After all, the Web-exposed
> encoding of the DOM and JS is UTF-16. Are we willing to pay the
> conversion cost at the Rust/JS boundary in order to have the saner
> internal representation? (In general, the current DOM looks rather
> tentative. It doesn't seem to support e.g. Namespaces.)

Very true, not much in terms of an actual DOM implementation in Servo
yet. What we have right now is mostly a proof of concept that's fleshed
out only enough to prove the RCU implementation of the DOM and the
interactions between the DOM task and the layout task(s). Lots of work
remains there.

> Is there a plan to use a special interned string type for identifiers
> the way we use nsIAtom* in Gecko?

I would imagine that we will use something like nsINodeInfo, and
something like nsIAtom, as they not only help with memory use, but also
with performance when doing selector matching etc.

> Should I expect data from the character encoding converters to be in a
> specific kind of array as opposed to being in Rust strings? If the
> parser should operate on UTF-8, it should probably still operate on
> bytes instead of characters to avoid the cost of iterating over a Rust
> string by character rather than by byte considering that the HTML
> parsing algorithm is designed in such a way that all decisions in the
> tokenizer can be made by looking at values in the ASCII range only.

Good questions. I think it'd be worth while doing some experimentation
and measurement here to see if there are clear benefits with parsing
UTF-8 vs UTF-16 data, especially on relatively slow mobile hardware. At
*some* point, some string data will likely need to be converted to
UTF-16, but it's not clear to me whether that means we need our charset
converters to produce UTF-16 and for the parser to parse UTF-16. I'd be
very interested in seeing performance numbers for parsing UTF-8 vs
UTF-16, again, especially on slow mobile hardware. The CPUs there are
far more constrained on CPU caches etc, so having half the memory
(obviously depending on the data in question) to parse over seems like
it could help speed things up.

In gecko we do crazy things today where we convert incoming bytes to
UTF-16 in our charset converters, parse the UTF-16, and for text data
that ends up in DOM text nodes we inspect the data once more to check
whether or not it's ASCII only and convert back to ASCII and store that
if possible. I would like to have a saner system in Servo if at all
possible. (primary reasons for this was back in the day that ASCII text
rendering was much faster using the single byte versions of various OSes
etc, I have no idea whether that's still the case).

One possibility could be to have our charset converters produce UTF-8,
the parser parses that, and converts attribute values etc to UTF-16 as
it creates attributes etc. As for text nodes, we should investigate what
makes the most sense there with modern OSes and their text rendering etc.

> (I assume that we should keep the same off the main thread parsing
> design as in Gecko to avoid regressing parallelism on a point where
> there is parallelism in Gecko.)

Yes, that's definitely the starting point that I had envisioned here.
I.e. run the parser in its own rust task, letting it create more tasks
for speculative parsing etc as needed. And keep in mind that there will
likely be more than one DOM task too, i.e. the notion of the "main
thread" will likely not apply in Servo.

Once we get to that point, we could look into more exotic parser
parallelization schemes etc, but it's questionable whether that's even
worth it. It's certainly not high on my list of priorities here.

-- 
jst
_______________________________________________
dev-servo mailing list
dev-servo@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-servo

Re: [dev-servo] HTML parser-related datatypes

Reply via email to