Should we implement character encoding detection [1] at the same time as the 
rest of the HTML parser?  It seems to be separable; the only design 
interactions I see are:

- The character decoder and script APIs can write into the same input stream
- The encoding can change during parsing [2], which can be handled as a
  mostly-normal navigation

Also, should we follow Gecko in representing the input stream as a queue of 
UTF-16 buffers?  With UTF-8 we would have about half as much data to stream 
through the parser, and (on 64-bit) we could do case-insensitive ASCII 
operations 8 characters at a time.

Most content [3] is UTF-8, and the trend is in that direction.  But for 
non-UTF-8 content we would pay the price of two conversions, if the ultimate 
product is UCS-2 DOM strings.  However I don't think we've finalized our 
representation of DOM strings, and we might do the UCS-2 conversion lazily [4].

(I'm ignoring the case of content served as UTF-16 because it's so rare.)

For Windows-1252 (aka "ISO 8859-1" on the Web) I'm sure that we can find or 
write a very fast translator to UTF-8, with a fast path for long runs of ASCII. 
 Or we can have the HTML parser assume ASCII-superset but not UTF-8 
specifically.  One obstacle is that script could write characters that aren't 
representable in the source charset.

Speaking of which, [5]

> Any character that is a not a Unicode character, i.e. any isolated surrogate,
> is a parse error. (These can only find their way into the input stream via
> script APIs such as document.write().)

I don't see a designated error-recovery behavior, unlike most parse errors in 
the spec.  Is there an implicit behavior that applies to input stream 
preprocessing?  Anyway I hope that this means we don't need to represent 
isolated surrogates in the input to a UTF-8 parser.

keegan


[1] 
http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#the-input-byte-stream
[2] 
http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#changing-the-encoding-while-parsing
[3] http://w3techs.com/technologies/overview/character_encoding/all/
[4] https://github.com/mozilla/servo/issues/1880
[5] 
http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#input-stream
_______________________________________________
dev-servo mailing list
dev-servo@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-servo

Reply via email to