Re: [dev-servo] Exploratory work for layout on the GPU

Patrick Walton Mon, 10 Mar 2014 21:16:30 -0700

As a follow-up, I profiled a synthetic approximation of selectormatching on the AMD Kaveri (AMD A10-7850K 3.70 GHz quad-core CPU +Radeon R7 8-compute-unit GPU). This is a new "APU", released in Q1 thisyear, which has an integrated CPU and GPU on the same die andcache-coherent memory accesses between the two. Since CSS layout is somemory bound, this seems of interest to us.

Here are the results. The first section of the table represents thedevices in my MacBook Pro; the second represents the devices on theKaveri APU.


Number of DOM nodes: 102,400
Max DOM depth: 10 nodes
CSS styles: 32 32-bit values
DOM node size: 136 bytes
Number of CSS rules: 32 (25 ID rules + 8 tag rules)

 Device               | Memory          | Copy/map time | Execution time
----------------------+-----------------+---------------+----------------
 GPU: GeForce GT 650M | Device, copying | 8.85 ms       | 3.44 ms
 GPU: GeForce GT 650M | Device, mapped  | 0.04 ms       | 7.04 ms
 CPU: Core i7 2.70GHz | Direct          | 0.02 ms       | 5.35 ms
----------------------+-----------------+---------------+----------------
 GPU: A10-7850K APU   | Device, copying | 29.81 ms      | 3.31 ms
 GPU: A10-7850K APU   | Device, mapped  | 0.02 ms       | 2.42 ms
 GPU: A10-7850K APU   | Host, shared    | 0.00 ms       | 19.63 ms
 CPU: A10-7850K APU   | Direct          | 0.02 ms       | 7.62 ms
----------------------+-----------------+---------------+----------------

Some interesting conclusions that I've tentatively drawn from this:

* The performance of host shared memory (the cache coherent stuff) isdisappointing for our use case. AMD claims in the HSA manual that theshared memory bus operates at about 50% of the bandwidth available. Thissignificantly hurts CSS selector matching, which is quite memory bound.

* On the other hand, the cache coherent memory might be useful as a wayfor the GPU to "hand off" work items to the CPU, even if we don't storethe whole DOM/frame tree in it.

* The AMD GPUs, at least on the tightly integrated Kaveri APU, seem tobe much more happy with strided memory accesses. This is good news forus: it means that using the GPU for selector matching on a dynamic DOMnow appears more viable.

* DMA bandwidth is really high; transfers between CPU and GPU are notreally that costly on desktops if DMA is used, at least compared to thecost of the selector matching as a whole.

* As far as I interpret the docs, AMD states that on the HSA devicememory can be read from the host in a zero-copy manner (with somepenalty), which could explain why the "device, mapped" section is socheap on that system. (On the other hand, it could just be that the DMAis so fast when both components are on the same die that I don't evennotice it happening.)

* At least on the Kaveri APU, it tentatively seems that allocating theDOM and style sheets on device memory and using the GPU is faster thanusing the CPU for selector matching. The GPU even beats a quad-core Corei7, which is pretty neat.

* I have not tested the dynamic scheduling stuff -- i.e. following linksthroughout the DOM tree -- because the driver support is not yet therefor that. (AMD is planning on a Q2 release of the necessary drivers.)For this and many other reasons, take these numbers with a grain of salt.


I updated my repo with the newest code.

Patrick

_______________________________________________
dev-servo mailing list
dev-servo@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-servo

Re: [dev-servo] Exploratory work for layout on the GPU

Reply via email to