Andrew Stubbs <a...@codesourcery.com> writes: > This patch implements a floating-point fold_left_plus vector pattern, > which gives a significant speed-up in the BabelStream "dot" benchmark. > > The GCN architecture can't actually do an in-order vector reduction any > more efficiently than that equivalent scalar algorithm, so this is a bit > of a cheat. However, dividing the problem into threads using OpenACC or > OpenMP has already broken the in-order semantics, so we may as well > optimize the operation at the vector level too. > > If the user has specifically sorted the input data in order to get a > more correct FP result then using multiple threads is already the wrong > thing to do. But, if the input data is in no particular numerical order > then this optimization will give a correct answer much faster, albeit > possibly a slightly different one each run.
There doesn't seem to be anything GCN-specific here though. If pragmas say that we can ignore associativity rules, we should apply that in target-independent code rather than in each individual target. Thanks, Richard