On 10/03/2014 23:54, Keegan McAllister wrote:
[...]

Also, should we follow Gecko in representing the input stream as a
queue of UTF-16 buffers?  With UTF-8 we would have about half as much
data to stream through the parser, and (on 64-bit) we could do
case-insensitive ASCII operations 8 characters at a time.

UTF-8 would be nice, but I think we’re stuck with UTF-16 for the reasons below. (Actually sequences of 16 bit integers containing potentially-invalid UTF-16. Is that called "UCS-2"?)

Or I guess we could use what I’ll call "evil UTF-8", which is UTF-8 without the artificial restriction of not encoding surrogates. But I feel bad just suggesting it. (This restriction, as well as the upper limit of U+10FFFF, exist to align with the value-space of UTF-16. UTF-8’s underlying algorithm is perfectly fine without either.)


Speaking of which, [5]

Any character that is a not a Unicode character, i.e. any isolated
surrogate, is a parse error. (These can only find their way into
the input stream via script APIs such as document.write().)

I don't see a designated error-recovery behavior, unlike most parse
errors in the spec.  Is there an implicit behavior that applies to
input stream preprocessing?  Anyway I hope that this means we don't
need to represent isolated surrogates in the input to a UTF-8
parser.

[5] 
http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#input-stream

As far as I understand, a "parse error" in the spec is meant for conformance checkers (validators), not user agents. There is no error recovery behavior, because this is not an error.


I’d be surprised if anyone relies on truly isolated surrogates, because Chrome gets very confused when rendering them:

data:text/html,<script>document.write("a\uD800b")</script>


Unfortunately I’d be less surprised if someone relies on having the two halves of a surrogate pair in separate document.write() call, as this seems more interoperable:

data:text/html,<script>document.write("\uD83D");document.write("\uDCA9")</script>

--
Simon Sapin
_______________________________________________
dev-servo mailing list
dev-servo@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-servo

Reply via email to