As a follow-up, I profiled a synthetic approximation of selector
matching on the AMD Kaveri (AMD A10-7850K 3.70 GHz quad-core CPU +
Radeon R7 8-compute-unit GPU). This is a new "APU", released in Q1 this
year, which has an integrated CPU and GPU on the same die and
cache-coherent memory accesses between the two. Since CSS layout is so
memory bound, this seems of interest to us.
Here are the results. The first section of the table represents the
devices in my MacBook Pro; the second represents the devices on the
Kaveri APU.
Number of DOM nodes: 102,400
Max DOM depth: 10 nodes
CSS styles: 32 32-bit values
DOM node size: 136 bytes
Number of CSS rules: 32 (25 ID rules + 8 tag rules)
Device | Memory | Copy/map time | Execution time
----------------------+-----------------+---------------+----------------
GPU: GeForce GT 650M | Device, copying | 8.85 ms | 3.44 ms
GPU: GeForce GT 650M | Device, mapped | 0.04 ms | 7.04 ms
CPU: Core i7 2.70GHz | Direct | 0.02 ms | 5.35 ms
----------------------+-----------------+---------------+----------------
GPU: A10-7850K APU | Device, copying | 29.81 ms | 3.31 ms
GPU: A10-7850K APU | Device, mapped | 0.02 ms | 2.42 ms
GPU: A10-7850K APU | Host, shared | 0.00 ms | 19.63 ms
CPU: A10-7850K APU | Direct | 0.02 ms | 7.62 ms
----------------------+-----------------+---------------+----------------
Some interesting conclusions that I've tentatively drawn from this:
* The performance of host shared memory (the cache coherent stuff) is
disappointing for our use case. AMD claims in the HSA manual that the
shared memory bus operates at about 50% of the bandwidth available. This
significantly hurts CSS selector matching, which is quite memory bound.
* On the other hand, the cache coherent memory might be useful as a way
for the GPU to "hand off" work items to the CPU, even if we don't store
the whole DOM/frame tree in it.
* The AMD GPUs, at least on the tightly integrated Kaveri APU, seem to
be much more happy with strided memory accesses. This is good news for
us: it means that using the GPU for selector matching on a dynamic DOM
now appears more viable.
* DMA bandwidth is really high; transfers between CPU and GPU are not
really that costly on desktops if DMA is used, at least compared to the
cost of the selector matching as a whole.
* As far as I interpret the docs, AMD states that on the HSA device
memory can be read from the host in a zero-copy manner (with some
penalty), which could explain why the "device, mapped" section is so
cheap on that system. (On the other hand, it could just be that the DMA
is so fast when both components are on the same die that I don't even
notice it happening.)
* At least on the Kaveri APU, it tentatively seems that allocating the
DOM and style sheets on device memory and using the GPU is faster than
using the CPU for selector matching. The GPU even beats a quad-core Core
i7, which is pretty neat.
* I have not tested the dynamic scheduling stuff -- i.e. following links
throughout the DOM tree -- because the driver support is not yet there
for that. (AMD is planning on a Q2 release of the necessary drivers.)
For this and many other reasons, take these numbers with a grain of salt.
I updated my repo with the newest code.
Patrick
_______________________________________________
dev-servo mailing list
dev-servo@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-servo