[dev-servo] HTML parser-related datatypes

Henri Sivonen Wed, 10 Oct 2012 05:51:34 -0700

I am researching/prototyping a translation of the same HTML parser we
use in Gecko into Rust for use in Servo. Should the HTML parser in
Servo operate on UTF-8, UTF-16 or CESU-8? What will the DOM in Servo
use internally?


It appears that currently Servo uses UTF-8 contained in Rust strings
internally, but is that going to stay? After all, the Web-exposed
encoding of the DOM and JS is UTF-16. Are we willing to pay the
conversion cost at the Rust/JS boundary in order to have the saner
internal representation? (In general, the current DOM looks rather
tentative. It doesn't seem to support e.g. Namespaces.)

Is there a plan to use a special interned string type for identifiers
the way we use nsIAtom* in Gecko?

Should I expect data from the character encoding converters to be in a
specific kind of array as opposed to being in Rust strings? If the
parser should operate on UTF-8, it should probably still operate on
bytes instead of characters to avoid the cost of iterating over a Rust
string by character rather than by byte considering that the HTML
parsing algorithm is designed in such a way that all decisions in the
tokenizer can be made by looking at values in the ASCII range only.

(I assume that we should keep the same off the main thread parsing
design as in Gecko to avoid regressing parallelism on a point where
there is parallelism in Gecko.)

-- 
Henri Sivonen
hsivo...@iki.fi
http://hsivonen.iki.fi/
_______________________________________________
dev-servo mailing list
dev-servo@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-servo

[dev-servo] HTML parser-related datatypes

Reply via email to