Re: [dev-servo] character encoding in the HTML parser
On 29/03/2014 23:15, Boris Zbarsky wrote: On 3/29/14 6:56 PM, Simon Sapin wrote: Or I guess we could use what I’ll call "evil UTF-8", which is UTF-8 without the artificial restriction of not encoding surrogates. http://en.wikipedia.org/wiki/CESU-8 CESU-8 is evil too, but it’s not what I had in mind. Its main characteristic is encoding non-BMP characters as surrogates pairs, which does not change the value space. But http://www.unicode.org/reports/tr26/ is unclear whether CESU-8 allows unpaired surrogates (which was the issue in the previous message.) I suppose it does not, by virtue of valid UTF-16 not allowing them either. -- Simon Sapin ___ dev-servo mailing list dev-servo@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-servo
Re: [dev-servo] character encoding in the HTML parser
On 29/03/2014 22:56, Simon Sapin wrote: On 10/03/2014 23:54, Keegan McAllister wrote: Speaking of which, [5] Any character that is a not a Unicode character, i.e. any isolated surrogate, is a parse error. (These can only find their way into the input stream via script APIs such as document.write().) I don't see a designated error-recovery behavior, unlike most parse errors in the spec. Is there an implicit behavior that applies to input stream preprocessing? Anyway I hope that this means we don't need to represent isolated surrogates in the input to a UTF-8 parser. [5] http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#input-stream As far as I understand, a "parse error" in the spec is meant for conformance checkers (validators), not user agents. There is no error recovery behavior, because this is not an error. I’d be surprised if anyone relies on truly isolated surrogates, because Chrome gets very confused when rendering them: data:text/html,document.write("a\uD800b") Unfortunately I’d be less surprised if someone relies on having the two halves of a surrogate pair in separate document.write() call, as this seems more interoperable: data:text/html,document.write("\uD83D");document.write("\uDCA9") More on this: https://www.w3.org/Bugs/Public/show_bug.cgi?id=11298#c2 https://github.com/html5lib/html5lib-tests/issues/19 (Thanks Geoffrey Sneddon for doing the archaeology work.) -- Simon Sapin ___ dev-servo mailing list dev-servo@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-servo
Re: [dev-servo] Crazy idea: CSS selector JITting at parse time
On 3/29/14 7:22 PM, Patrick Walton wrote: On a related note, I have been tossing around ideas today for using SIMD to match multiple selectors that have the same "shape" in parallel. For example, if we have ".foo #a" and ".bar #a" it may be possible to use the packed comparison instructions in SSE4 to match both at the same time. Obviously, this adds significant complexity and correspondingly increased maintenance burden, and its effectiveness depends on how often selectors have the same shape in the wild (if it works at all). So I'm filing it into the "potentially interesting project, not a high priority" mental bin. Could be neat though. Just for fun, I tried some experiments with using SSE4 SIMD instructions to match the four selectors `.class0 #foo`/`.class1 #foo`/`.class2 #foo`/`.class3 #foo` in parallel on some random DOMs (500,000 DOM nodes with rand(0..8) random classes per node, assuming 16 classes in the stylesheet). I observed a 27% speedup on my Core i7. This is not amazing, and I suspect the problem is that the effectiveness of the increased parallelism provided by the vector instructions is offset by the increased number of memory accesses that the SIMD instructions force you into. Of course, I should try on a snapshot of a real Web page (the HTML5 spec, perhaps), but I don't expect to do much better. 27% is not bad, but there are obviously much higher priority things to try first (e.g. multithreading or GPUs, both of which win by a lot more). Patrick ___ dev-servo mailing list dev-servo@lists.mozilla.org https://lists.mozilla.org/listinfo/dev-servo