Re: [Beowulf] What class of PDEs/numerical schemes suitable for GPUclusters

Michael Brown Sat, 22 Nov 2008 16:11:25 -0800

Jeff Layton wrote:

offhand, I'd guess that adaptive grids will be substantially harder
to run efficiently on a GPU than a uniform grid.


One key thing is that unstructured grid codes don't work as well.
The problem is the indirect addressing.

Bingo. GPUs are still GPUs, and are still heavily optimized for coherentdata access patterns. If cell (x, y) depends on data at (x, y), then cell (x+ 1, y) better depend on data at cell (x + 1, y) or performance will sufferterribly. In C-speak:

 x += C[i][j];
is good, and
 x += C[Idx[i][j]];
is bad. Similarly bad is non-coherent branching, due to the thread grouping.

The ideal workload is one that had minimal or no branching, and can bemapped into a computational model where you have a 1-, 2-, or 3-dimensionalarrangement of cells, where the computation (including the relative positionfor any data lookups) for each cell does not change. IME, as soon as youdepart significantly from this workload, you often start to see order ofmagnitude drops in performance. Additionally, the round-trip CPU->GPU->CPUlatency is horrific (in the order of 1 ms on my 8800GTX on Vista, though I'mnot sure about the newer cards or other OSes) so unless you can get a goodpipeline going, bouncing computation between the CPU and GPU can wreck theoverall performance. This also makes it very hard to scale out to more thanone card.

I've spent a fair amount of time tweaking a bit of software that at its coreis a RKF45 adaptive integrator on a number of independent entities, withsome other GPU-unfriendly code (very branchy and with PRNGs). The optimalmethod that I've found for this code is to do the integration substeps onthe GPU, but all other processing on the CPU. The GPU doesn't worry if therequested substep has excessive error, it just passes back the betterstep-size to the CPU and doesn't update the data. The CPU then notices thatthe returned "next" stepsize is smaller than the stepsize it sent, andhandles the situation correctly. Subdividing steps on the GPU (or simplylooping around with the smaller step sizes until the error is sufficientlysmall) is a performance loss. Additionally, since the entities areessentially independent, I can have multiple sets in progress at once. Thepeak seems to be to break it into 4 sets, presumably corresponding to onebeing sent to the GPU, one being processed on the GPU, one coming back fromthe GPU, and one being processed on the CPU. The performance gain going from1 set to 4 is about a factor of 2.5.



Cheers,

Michael

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] What class of PDEs/numerical schemes suitable for GPUclusters

Reply via email to