[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel

peter at cordes dot ca Sun, 14 Jan 2018 04:47:01 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846


Peter Cordes <peter at cordes dot ca> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|RESOLVED                    |REOPENED
         Resolution|FIXED                       |---

--- Comment #21 from Peter Cordes <peter at cordes dot ca> ---
(In reply to Richard Biener from comment #20)
> Fixed.

Unfortunately only fixed for integer, not FP.  The OpenMP and vanilla float
array sum functions from the godbolt link in the initial bug report still use
256b shuffles, including a gratuitous vperm2f128 when the upper half isn't
used, so vextractf128 would have done the same job in 1 uop on Ryzen instead of
8.

Even on Intel CPUs, they're optimized for code-size, not performance (vhaddps
instead of shuffle / vaddps).  Remember that Intel CPUs with AVX only have one
FP shuffle unit.  (Including Sandy/Ivybridge, which has 2 integer-128 shuffle
units)

float sumfloat_autovec(const float arr[]) {
  arr = __builtin_assume_aligned(arr, 64);
  float sum=0;
  for (int i=0 ; i<1024 ; i++)
    sum = sum + arr[i];
  return sum;
}

# gcc 20180113 -mavx2 -ffast-math -O3
#  (tune=generic, and even with arch=znver1 no-prefer-avx128)
        ...
        vhaddps %ymm0, %ymm0, %ymm0
        vhaddps %ymm0, %ymm0, %ymm1
        vperm2f128      $1, %ymm1, %ymm1, %ymm0   # why not vextract?
        vaddps  %ymm1, %ymm0, %ymm0               # gratuitous 256b
        vzeroupper

This bug is still present for FP code: it narrows from 256b to scalar only in
the last step.

Every VHADDPS is 2 shuffles + 1 add on Intel.  They're in-lane shuffles, but
it's still 2 uops for port5 vs. VSHUFPS + VADDPS.  (Costing an extra cycle of
latency because with only 1 shuffle port, the 2 interleave-shuffles that feed a
vertical-add uop can't run in the same cycle.)  (V)HADDPS with the same input
twice is almost never the best choice for performance.

On Ryzen it's an even bigger penalty: HADDPS xmm is 4 uops (vs. 3 on Intel). 
It's also 7c latency (vs. 3 for ADDPS).  256b VHADDPS ymm is 8 uops, one per 3
cycle throughput, and Agner Fog reports that it's "mixed domain", i.e. some
kind of penalty for ivec / fp domain crossing.  I guess the shuffles it uses
internally are ivec domain?

With multiple threads on the same core, or even with ILP with surrounding code,
uop throughput matters as well as latency, so more uops is worse even if it
didn't have latency costs.

The sequence I'd recommend (for both Intel and AMD) is:
(See also
http://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal-float-vector-sum-on-x86/35270026#35270026)


        vextractf128    $1, %ymm0, %xmm1
        vaddps          %xmm1, %xmm0, %xmm0          # narrow to 128b

        vmovshdup       %xmm0, %xmm0, %xmm1          # copy high->low in each
pair
        vaddps          %xmm1, %xmm0, %xmm0

        vmovhlps        %xmm0, %xmm0, %xmm1          # duplicate high 64b
        vaddps          %xmm1, %xmm0, %xmm0

The MOVSHDUP / MOVHLPS sequence is also what you want without VEX, so you can
do a 128b hsum with 4 instructions, with no MOVAPS.

Intel: 6 uops total, 3 shuffles.  vs. 8 total, 5 shuffles

AMD Ryzen: 6 uops, 3 shuffles.  vs. 26 total uops, 20 of them shuffles.  And
much worse latency, too.

Even just fixing this specific bug without fixing the rest of the sequence
would help AMD *significantly*, because vextractf128 is very cheap, and vhaddps
xmm is only half the uops of ymm.  (But the latency still sucks).

-----

Even for integer, this patch didn't fix the MOVDQA + PSRLDQ that we get without
AVX.  PSHUFD or PSHUFLW to copy+shuffle is cheaper.  I guess I need to report
that bug separately, because it probably won't get fixed soon: if I understand
correctly, there's no mechanism for the back-end to tell the auto-vectorizer
what shuffles it can do efficiently!

It usually won't make too much difference, but for very small arrays (like 8
`int` elements) the hsum is a big part of the cost, although it's probably
still worth auto-vectorizing *if* you can do it efficiently.

[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel

Reply via email to