As a follow-up, I profiled a synthetic approximation of selector matching on the AMD Kaveri (AMD A10-7850K 3.70 GHz quad-core CPU + Radeon R7 8-compute-unit GPU). This is a new "APU", released in Q1 this year, which has an integrated CPU and GPU on the same die and cache-coherent memory accesses between the two. Since CSS layout is so memory bound, this seems of interest to us.

Here are the results. The first section of the table represents the devices in my MacBook Pro; the second represents the devices on the Kaveri APU.

Number of DOM nodes: 102,400
Max DOM depth: 10 nodes
CSS styles: 32 32-bit values
DOM node size: 136 bytes
Number of CSS rules: 32 (25 ID rules + 8 tag rules)

 Device               | Memory          | Copy/map time | Execution time
----------------------+-----------------+---------------+----------------
 GPU: GeForce GT 650M | Device, copying | 8.85 ms       | 3.44 ms
 GPU: GeForce GT 650M | Device, mapped  | 0.04 ms       | 7.04 ms
 CPU: Core i7 2.70GHz | Direct          | 0.02 ms       | 5.35 ms
----------------------+-----------------+---------------+----------------
 GPU: A10-7850K APU   | Device, copying | 29.81 ms      | 3.31 ms
 GPU: A10-7850K APU   | Device, mapped  | 0.02 ms       | 2.42 ms
 GPU: A10-7850K APU   | Host, shared    | 0.00 ms       | 19.63 ms
 CPU: A10-7850K APU   | Direct          | 0.02 ms       | 7.62 ms
----------------------+-----------------+---------------+----------------

Some interesting conclusions that I've tentatively drawn from this:

* The performance of host shared memory (the cache coherent stuff) is disappointing for our use case. AMD claims in the HSA manual that the shared memory bus operates at about 50% of the bandwidth available. This significantly hurts CSS selector matching, which is quite memory bound.

* On the other hand, the cache coherent memory might be useful as a way for the GPU to "hand off" work items to the CPU, even if we don't store the whole DOM/frame tree in it.

* The AMD GPUs, at least on the tightly integrated Kaveri APU, seem to be much more happy with strided memory accesses. This is good news for us: it means that using the GPU for selector matching on a dynamic DOM now appears more viable.

* DMA bandwidth is really high; transfers between CPU and GPU are not really that costly on desktops if DMA is used, at least compared to the cost of the selector matching as a whole.

* As far as I interpret the docs, AMD states that on the HSA device memory can be read from the host in a zero-copy manner (with some penalty), which could explain why the "device, mapped" section is so cheap on that system. (On the other hand, it could just be that the DMA is so fast when both components are on the same die that I don't even notice it happening.)

* At least on the Kaveri APU, it tentatively seems that allocating the DOM and style sheets on device memory and using the GPU is faster than using the CPU for selector matching. The GPU even beats a quad-core Core i7, which is pretty neat.

* I have not tested the dynamic scheduling stuff -- i.e. following links throughout the DOM tree -- because the driver support is not yet there for that. (AMD is planning on a Q2 release of the necessary drivers.) For this and many other reasons, take these numbers with a grain of salt.

I updated my repo with the newest code.

Patrick

_______________________________________________
dev-servo mailing list
dev-servo@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-servo

Reply via email to