[dev-servo] character encoding in the HTML parser

Keegan McAllister Mon, 10 Mar 2014 16:56:31 -0700

Should we implement character encoding detection [1] at the same time as the 
rest of the HTML parser?  It seems to be separable; the only design 
interactions I see are:

- The character decoder and script APIs can write into the same input stream
- The encoding can change during parsing [2], which can be handled as a
mostly-normal navigation

Also, should we follow Gecko in representing the input stream as a queue of
UTF-16 buffers? With UTF-8 we would have about half as much data to stream
through the parser, and (on 64-bit) we could do case-insensitive ASCII
operations 8 characters at a time.

Most content [3] is UTF-8, and the trend is in that direction. But for
non-UTF-8 content we would pay the price of two conversions, if the ultimate
product is UCS-2 DOM strings. However I don't think we've finalized our
representation of DOM strings, and we might do the UCS-2 conversion lazily [4].

(I'm ignoring the case of content served as UTF-16 because it's so rare.)

For Windows-1252 (aka "ISO 8859-1" on the Web) I'm sure that we can find or
write a very fast translator to UTF-8, with a fast path for long runs of ASCII.
Or we can have the HTML parser assume ASCII-superset but not UTF-8
specifically. One obstacle is that script could write characters that aren't
representable in the source charset.

Speaking of which, [5]

> Any character that is a not a Unicode character, i.e. any isolated surrogate,
> is a parse error. (These can only find their way into the input stream via
> script APIs such as document.write().)

I don't see a designated error-recovery behavior, unlike most parse errors in
the spec. Is there an implicit behavior that applies to input stream
preprocessing? Anyway I hope that this means we don't need to represent
isolated surrogates in the input to a UTF-8 parser.

keegan

[1]
http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#the-input-byte-stream
[2]
http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#changing-the-encoding-while-parsing
[3] http://w3techs.com/technologies/overview/character_encoding/all/
[4] https://github.com/mozilla/servo/issues/1880
[5]
http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#input-stream
_______________________________________________
dev-servo mailing list
[email protected]
https://lists.mozilla.org/listinfo/dev-servo

[dev-servo] character encoding in the HTML parser

Reply via email to