Should we implement character encoding detection [1] at the same time as the rest of the HTML parser? It seems to be separable; the only design interactions I see are:
- The character decoder and script APIs can write into the same input stream - The encoding can change during parsing [2], which can be handled as a mostly-normal navigation Also, should we follow Gecko in representing the input stream as a queue of UTF-16 buffers? With UTF-8 we would have about half as much data to stream through the parser, and (on 64-bit) we could do case-insensitive ASCII operations 8 characters at a time. Most content [3] is UTF-8, and the trend is in that direction. But for non-UTF-8 content we would pay the price of two conversions, if the ultimate product is UCS-2 DOM strings. However I don't think we've finalized our representation of DOM strings, and we might do the UCS-2 conversion lazily [4]. (I'm ignoring the case of content served as UTF-16 because it's so rare.) For Windows-1252 (aka "ISO 8859-1" on the Web) I'm sure that we can find or write a very fast translator to UTF-8, with a fast path for long runs of ASCII. Or we can have the HTML parser assume ASCII-superset but not UTF-8 specifically. One obstacle is that script could write characters that aren't representable in the source charset. Speaking of which, [5] > Any character that is a not a Unicode character, i.e. any isolated surrogate, > is a parse error. (These can only find their way into the input stream via > script APIs such as document.write().) I don't see a designated error-recovery behavior, unlike most parse errors in the spec. Is there an implicit behavior that applies to input stream preprocessing? Anyway I hope that this means we don't need to represent isolated surrogates in the input to a UTF-8 parser. keegan [1] http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#the-input-byte-stream [2] http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#changing-the-encoding-while-parsing [3] http://w3techs.com/technologies/overview/character_encoding/all/ [4] https://github.com/mozilla/servo/issues/1880 [5] http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#input-stream _______________________________________________ dev-servo mailing list dev-servo@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-servo