Hi everyone,
Over the weekend I created a small repo to play around with selector
matching on the GPU:
https://github.com/pcwalton/selectron
There's a rough prototype of selector matching in there, with CPU,
OpenCL (GPU and CPU), and CUDA versions. I've only tried it on my MBP's
GeForce GT 650M.
So far the numbers have not been particularly good: 10%-100% slower than
the parallel CPU numbers, depending on the size, even not counting
memory transfer. (It is assumed that anywhere we would want to deploy
this would have zero-copy operation.)
From my profiling it seems as though the branchiness of selector
matching is not the problem: selector matching is surprisingly
straight-line if the hash tables are implemented properly. Rather the
issue is that at least my GPU really wants to read in multiple DOM nodes
in the same 128-byte cache line. Because it's not a realistic assumption
that more than a couple of DOM nodes fit in a 128-byte cache line, 89%
(!!) of instructions end up replayed because of memory reads.
Artificially lowering the size of DOM nodes to unrealistic levels and
packing them together (cheating as far as I'm concerned) brings the
performance up again. But I don't know how to make that work in the face
of a dynamically changing DOM.
I'll try soon on Kaveri, but indications are that we'll have some
hurdles to overcome.
Patrick
_______________________________________________
dev-servo mailing list
dev-servo@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-servo