Re: [dev-servo] character encoding in the HTML parser

Simon Sapin Sun, 30 Mar 2014 09:19:24 -0700

On 29/03/2014 22:56, Simon Sapin wrote:

On 10/03/2014 23:54, Keegan McAllister wrote:

Speaking of which, [5]

Any character that is a not a Unicode character, i.e. any isolated
surrogate, is a parse error. (These can only find their way into
the input stream via script APIs such as document.write().)


I don't see a designated error-recovery behavior, unlike most parse
errors in the spec.  Is there an implicit behavior that applies to
input stream preprocessing?  Anyway I hope that this means we don't
need to represent isolated surrogates in the input to a UTF-8
parser.

[5] 
http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#input-stream


As far as I understand, a "parse error" in the spec is meant for
conformance checkers (validators), not user agents. There is no error
recovery behavior, because this is not an error.


I’d be surprised if anyone relies on truly isolated surrogates, because
Chrome gets very confused when rendering them:

data:text/html,<script>document.write("a\uD800b")</script>


Unfortunately I’d be less surprised if someone relies on having the two
halves of a surrogate pair in separate document.write() call, as this
seems more interoperable:

data:text/html,<script>document.write("\uD83D");document.write("\uDCA9")</script>


More on this:

https://www.w3.org/Bugs/Public/show_bug.cgi?id=11298#c2
https://github.com/html5lib/html5lib-tests/issues/19

(Thanks Geoffrey Sneddon for doing the archaeology work.)


--
Simon Sapin
_______________________________________________
dev-servo mailing list
[email protected]
https://lists.mozilla.org/listinfo/dev-servo

Re: [dev-servo] character encoding in the HTML parser

Reply via email to