Hi everyone,

Over the weekend I created a small repo to play around with selector matching on the GPU:

https://github.com/pcwalton/selectron

There's a rough prototype of selector matching in there, with CPU, OpenCL (GPU and CPU), and CUDA versions. I've only tried it on my MBP's GeForce GT 650M.

So far the numbers have not been particularly good: 10%-100% slower than the parallel CPU numbers, depending on the size, even not counting memory transfer. (It is assumed that anywhere we would want to deploy this would have zero-copy operation.)

From my profiling it seems as though the branchiness of selector matching is not the problem: selector matching is surprisingly straight-line if the hash tables are implemented properly. Rather the issue is that at least my GPU really wants to read in multiple DOM nodes in the same 128-byte cache line. Because it's not a realistic assumption that more than a couple of DOM nodes fit in a 128-byte cache line, 89% (!!) of instructions end up replayed because of memory reads. Artificially lowering the size of DOM nodes to unrealistic levels and packing them together (cheating as far as I'm concerned) brings the performance up again. But I don't know how to make that work in the face of a dynamically changing DOM.

I'll try soon on Kaveri, but indications are that we'll have some hurdles to overcome.

Patrick
_______________________________________________
dev-servo mailing list
dev-servo@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-servo

Reply via email to