Now backported to OG10.
Andrew
On 03/07/2020 11:11, Andrew Stubbs wrote:
This patch implements a floating-point fold_left_plus vector pattern,
which gives a significant speed-up in the BabelStream "dot" benchmark.
The GCN architecture can't actually do an in-order vector reduction any
more efficiently than that equivalent scalar algorithm, so this is a bit
of a cheat. However, dividing the problem into threads using OpenACC or
OpenMP has already broken the in-order semantics, so we may as well
optimize the operation at the vector level too.
If the user has specifically sorted the input data in order to get a
more correct FP result then using multiple threads is already the wrong
thing to do. But, if the input data is in no particular numerical order
then this optimization will give a correct answer much faster, albeit
possibly a slightly different one each run.
Andrew