date:20140330

Re: [dev-servo] character encoding in the HTML parser

2014-03-30 Thread Simon Sapin


On 29/03/2014 23:15, Boris Zbarsky wrote:

On 3/29/14 6:56 PM, Simon Sapin wrote:

Or I guess we could use what I’ll call "evil UTF-8", which is UTF-8
without the artificial restriction of not encoding surrogates.

http://en.wikipedia.org/wiki/CESU-8


CESU-8 is evil too, but it’s not what I had in mind. Its main 
characteristic is encoding non-BMP characters as surrogates pairs, which 
does not change the value space.


But http://www.unicode.org/reports/tr26/ is unclear whether CESU-8 
allows unpaired surrogates (which was the issue in the previous 
message.) I suppose it does not, by virtue of valid UTF-16 not allowing 
them either.


--
Simon Sapin
___
dev-servo mailing list
dev-servo@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-servo

Re: [dev-servo] character encoding in the HTML parser

2014-03-30 Thread Simon Sapin


On 29/03/2014 22:56, Simon Sapin wrote:

On 10/03/2014 23:54, Keegan McAllister wrote:

Speaking of which, [5]


Any character that is a not a Unicode character, i.e. any isolated
surrogate, is a parse error. (These can only find their way into
the input stream via script APIs such as document.write().)


I don't see a designated error-recovery behavior, unlike most parse
errors in the spec.  Is there an implicit behavior that applies to
input stream preprocessing?  Anyway I hope that this means we don't
need to represent isolated surrogates in the input to a UTF-8
parser.

[5] 
http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#input-stream


As far as I understand, a "parse error" in the spec is meant for
conformance checkers (validators), not user agents. There is no error
recovery behavior, because this is not an error.


I’d be surprised if anyone relies on truly isolated surrogates, because
Chrome gets very confused when rendering them:

data:text/html,document.write("a\uD800b")


Unfortunately I’d be less surprised if someone relies on having the two
halves of a surrogate pair in separate document.write() call, as this
seems more interoperable:

data:text/html,document.write("\uD83D");document.write("\uDCA9")


More on this:

https://www.w3.org/Bugs/Public/show_bug.cgi?id=11298#c2
https://github.com/html5lib/html5lib-tests/issues/19

(Thanks Geoffrey Sneddon for doing the archaeology work.)


--
Simon Sapin
___
dev-servo mailing list
dev-servo@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-servo

Re: [dev-servo] Crazy idea: CSS selector JITting at parse time

2014-03-30 Thread Patrick Walton


On 3/29/14 7:22 PM, Patrick Walton wrote:

On a related note, I have been tossing around ideas today for using
SIMD to match multiple selectors that have the same "shape" in
parallel. For example, if we have ".foo #a" and ".bar #a" it may be
possible to use the packed comparison instructions in SSE4 to match
both at the same time. Obviously, this adds significant complexity
and correspondingly increased maintenance burden, and its
effectiveness depends on how often selectors have the same shape in
the wild (if it works at all). So I'm filing it into the "potentially
interesting project, not a high priority" mental bin. Could be neat
though.


Just for fun, I tried some experiments with using SSE4 SIMD instructions 
to match the four selectors `.class0 #foo`/`.class1 #foo`/`.class2 
#foo`/`.class3 #foo` in parallel on some random DOMs (500,000 DOM nodes 
with rand(0..8) random classes per node, assuming 16 classes in the 
stylesheet). I observed a 27% speedup on my Core i7. This is not 
amazing, and I suspect the problem is that the effectiveness of the 
increased parallelism provided by the vector instructions is offset by 
the increased number of memory accesses that the SIMD instructions force 
you into.


Of course, I should try on a snapshot of a real Web page (the HTML5 
spec, perhaps), but I don't expect to do much better. 27% is not bad, 
but there are obviously much higher priority things to try first (e.g. 
multithreading or GPUs, both of which win by a lot more).


Patrick

___
dev-servo mailing list
dev-servo@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-servo

Re: [dev-servo] character encoding in the HTML parser

Re: [dev-servo] character encoding in the HTML parser

Re: [dev-servo] Crazy idea: CSS selector JITting at parse time

3 matches

Site Navigation

Mail list logo

Footer information