On 10/03/2014 23:54, Keegan McAllister wrote:
[...]
Also, should we follow Gecko in representing the input stream as a
queue of UTF-16 buffers? With UTF-8 we would have about half as much
data to stream through the parser, and (on 64-bit) we could do
case-insensitive ASCII operations 8 characters at a time.
UTF-8 would be nice, but I think we’re stuck with UTF-16 for the reasons
below. (Actually sequences of 16 bit integers containing
potentially-invalid UTF-16. Is that called "UCS-2"?)
Or I guess we could use what I’ll call "evil UTF-8", which is UTF-8
without the artificial restriction of not encoding surrogates. But I
feel bad just suggesting it. (This restriction, as well as the upper
limit of U+10FFFF, exist to align with the value-space of UTF-16.
UTF-8’s underlying algorithm is perfectly fine without either.)
Speaking of which, [5]
Any character that is a not a Unicode character, i.e. any isolated
surrogate, is a parse error. (These can only find their way into
the input stream via script APIs such as document.write().)
I don't see a designated error-recovery behavior, unlike most parse
errors in the spec. Is there an implicit behavior that applies to
input stream preprocessing? Anyway I hope that this means we don't
need to represent isolated surrogates in the input to a UTF-8
parser.
[5]
http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#input-stream
As far as I understand, a "parse error" in the spec is meant for
conformance checkers (validators), not user agents. There is no error
recovery behavior, because this is not an error.
I’d be surprised if anyone relies on truly isolated surrogates, because
Chrome gets very confused when rendering them:
data:text/html,<script>document.write("a\uD800b")</script>
Unfortunately I’d be less surprised if someone relies on having the two
halves of a surrogate pair in separate document.write() call, as this
seems more interoperable:
data:text/html,<script>document.write("\uD83D");document.write("\uDCA9")</script>
--
Simon Sapin
_______________________________________________
dev-servo mailing list
dev-servo@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-servo