I am researching/prototyping a translation of the same HTML parser we use in Gecko into Rust for use in Servo. Should the HTML parser in Servo operate on UTF-8, UTF-16 or CESU-8? What will the DOM in Servo use internally?
It appears that currently Servo uses UTF-8 contained in Rust strings internally, but is that going to stay? After all, the Web-exposed encoding of the DOM and JS is UTF-16. Are we willing to pay the conversion cost at the Rust/JS boundary in order to have the saner internal representation? (In general, the current DOM looks rather tentative. It doesn't seem to support e.g. Namespaces.) Is there a plan to use a special interned string type for identifiers the way we use nsIAtom* in Gecko? Should I expect data from the character encoding converters to be in a specific kind of array as opposed to being in Rust strings? If the parser should operate on UTF-8, it should probably still operate on bytes instead of characters to avoid the cost of iterating over a Rust string by character rather than by byte considering that the HTML parsing algorithm is designed in such a way that all decisions in the tokenizer can be made by looking at values in the ASCII range only. (I assume that we should keep the same off the main thread parsing design as in Gecko to avoid regressing parallelism on a point where there is parallelism in Gecko.) -- Henri Sivonen hsivo...@iki.fi http://hsivonen.iki.fi/ _______________________________________________ dev-servo mailing list dev-servo@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-servo