Jeff Layton wrote:
offhand, I'd guess that adaptive grids will be substantially harder
to run efficiently on a GPU than a uniform grid.

One key thing is that unstructured grid codes don't work as well.
The problem is the indirect addressing.

Bingo. GPUs are still GPUs, and are still heavily optimized for coherent data access patterns. If cell (x, y) depends on data at (x, y), then cell (x + 1, y) better depend on data at cell (x + 1, y) or performance will suffer terribly. In C-speak:
 x += C[i][j];
is good, and
 x += C[Idx[i][j]];
is bad. Similarly bad is non-coherent branching, due to the thread grouping.

The ideal workload is one that had minimal or no branching, and can be mapped into a computational model where you have a 1-, 2-, or 3-dimensional arrangement of cells, where the computation (including the relative position for any data lookups) for each cell does not change. IME, as soon as you depart significantly from this workload, you often start to see order of magnitude drops in performance. Additionally, the round-trip CPU->GPU->CPU latency is horrific (in the order of 1 ms on my 8800GTX on Vista, though I'm not sure about the newer cards or other OSes) so unless you can get a good pipeline going, bouncing computation between the CPU and GPU can wreck the overall performance. This also makes it very hard to scale out to more than one card.

I've spent a fair amount of time tweaking a bit of software that at its core is a RKF45 adaptive integrator on a number of independent entities, with some other GPU-unfriendly code (very branchy and with PRNGs). The optimal method that I've found for this code is to do the integration substeps on the GPU, but all other processing on the CPU. The GPU doesn't worry if the requested substep has excessive error, it just passes back the better step-size to the CPU and doesn't update the data. The CPU then notices that the returned "next" stepsize is smaller than the stepsize it sent, and handles the situation correctly. Subdividing steps on the GPU (or simply looping around with the smaller step sizes until the error is sufficiently small) is a performance loss. Additionally, since the entities are essentially independent, I can have multiple sets in progress at once. The peak seems to be to break it into 4 sets, presumably corresponding to one being sent to the GPU, one being processed on the GPU, one coming back from the GPU, and one being processed on the CPU. The performance gain going from 1 set to 4 is about a factor of 2.5.


Cheers,
Michael
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to