Andrew Stubbs <a...@codesourcery.com> writes:
> This patch implements a floating-point fold_left_plus vector pattern, 
> which gives a significant speed-up in the BabelStream "dot" benchmark.
>
> The GCN architecture can't actually do an in-order vector reduction any 
> more efficiently than that equivalent scalar algorithm, so this is a bit 
> of a cheat.  However, dividing the problem into threads using OpenACC or 
> OpenMP has already broken the in-order semantics, so we may as well 
> optimize the operation at the vector level too.
>
> If the user has specifically sorted the input data in order to get a 
> more correct FP result then using multiple threads is already the wrong 
> thing to do. But, if the input data is in no particular numerical order 
> then this optimization will give a correct answer much faster, albeit 
> possibly a slightly different one each run.

There doesn't seem to be anything GCN-specific here though.
If pragmas say that we can ignore associativity rules, we should apply
that in target-independent code rather than in each individual target.

Thanks,
Richard

Reply via email to