https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|REOPENED |RESOLVED
Resolution|--- |FIXED
--- Comment #27 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Peter Cordes from comment #21)
> (In reply to Richard Biener from comment #20)
> > Fixed.
>
> Unfortunately only fixed for integer, not FP. The OpenMP and vanilla float
> array sum functions from the godbolt link in the initial bug report still
> use 256b shuffles, including a gratuitous vperm2f128 when the upper half
> isn't used, so vextractf128 would have done the same job in 1 uop on Ryzen
> instead of 8.
>
> Even on Intel CPUs, they're optimized for code-size, not performance
> (vhaddps instead of shuffle / vaddps). Remember that Intel CPUs with AVX
> only have one FP shuffle unit. (Including Sandy/Ivybridge, which has 2
> integer-128 shuffle units)
>
> float sumfloat_autovec(const float arr[]) {
> arr = __builtin_assume_aligned(arr, 64);
> float sum=0;
> for (int i=0 ; i<1024 ; i++)
> sum = sum + arr[i];
> return sum;
> }
>
> # gcc 20180113 -mavx2 -ffast-math -O3
> # (tune=generic, and even with arch=znver1 no-prefer-avx128)
> ...
> vhaddps %ymm0, %ymm0, %ymm0
> vhaddps %ymm0, %ymm0, %ymm1
> vperm2f128 $1, %ymm1, %ymm1, %ymm0 # why not vextract?
> vaddps %ymm1, %ymm0, %ymm0 # gratuitous 256b
> vzeroupper
>
> This bug is still present for FP code: it narrows from 256b to scalar only
> in the last step.
>
> Every VHADDPS is 2 shuffles + 1 add on Intel. They're in-lane shuffles, but
> it's still 2 uops for port5 vs. VSHUFPS + VADDPS. (Costing an extra cycle
> of latency because with only 1 shuffle port, the 2 interleave-shuffles that
> feed a vertical-add uop can't run in the same cycle.) (V)HADDPS with the
> same input twice is almost never the best choice for performance.
>
> On Ryzen it's an even bigger penalty: HADDPS xmm is 4 uops (vs. 3 on Intel).
> It's also 7c latency (vs. 3 for ADDPS). 256b VHADDPS ymm is 8 uops, one per
> 3 cycle throughput, and Agner Fog reports that it's "mixed domain", i.e.
> some kind of penalty for ivec / fp domain crossing. I guess the shuffles it
> uses internally are ivec domain?
>
> With multiple threads on the same core, or even with ILP with surrounding
> code, uop throughput matters as well as latency, so more uops is worse even
> if it didn't have latency costs.
>
> The sequence I'd recommend (for both Intel and AMD) is:
> (See also
> http://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal-
> float-vector-sum-on-x86/35270026#35270026)
>
>
> vextractf128 $1, %ymm0, %xmm1
> vaddps %xmm1, %xmm0, %xmm0 # narrow to 128b
>
> vmovshdup %xmm0, %xmm0, %xmm1 # copy high->low in
> each pair
> vaddps %xmm1, %xmm0, %xmm0
>
> vmovhlps %xmm0, %xmm0, %xmm1 # duplicate high 64b
> vaddps %xmm1, %xmm0, %xmm0
>
> The MOVSHDUP / MOVHLPS sequence is also what you want without VEX, so you
> can do a 128b hsum with 4 instructions, with no MOVAPS.
>
> Intel: 6 uops total, 3 shuffles. vs. 8 total, 5 shuffles
>
> AMD Ryzen: 6 uops, 3 shuffles. vs. 26 total uops, 20 of them shuffles. And
> much worse latency, too.
>
> Even just fixing this specific bug without fixing the rest of the sequence
> would help AMD *significantly*, because vextractf128 is very cheap, and
> vhaddps xmm is only half the uops of ymm. (But the latency still sucks).
Note that this is deliberately left as-is because the target advertises
(cheap) support for horizontal reduction. The vectorizer simply generates
a single statement for the reduction epilogue:
<bb 4> [local count: 10737418]:
stmp_sum_11.5_20 = REDUC_PLUS (vect_sum_11.4_6); [tail call]
return stmp_sum_11.5_20;
so either the target shouldn't tell the vectorizer it supports this or
it simply needs to expand to better code. Which means - can you open
a separate bug for this? The backend currently does:
(define_expand "reduc_plus_scal_v8sf"
[(match_operand:SF 0 "register_operand")
(match_operand:V8SF 1 "register_operand")]
"TARGET_AVX"
{
rtx tmp = gen_reg_rtx (V8SFmode);
rtx tmp2 = gen_reg_rtx (V8SFmode);
rtx vec_res = gen_reg_rtx (V8SFmode);
emit_insn (gen_avx_haddv8sf3 (tmp, operands[1], operands[1]));
emit_insn (gen_avx_haddv8sf3 (tmp2, tmp, tmp));
emit_insn (gen_avx_vperm2f128v8sf3 (tmp, tmp2, tmp2, GEN_INT (1)));
emit_insn (gen_addv8sf3 (vec_res, tmp, tmp2));
emit_insn (gen_vec_extractv8sfsf (operands[0], vec_res, const0_rtx));
DONE;
})
changing it to your sequence shouldn't be too difficult.
> -----
>
> Even for integer, this patch didn't fix the MOVDQA + PSRLDQ that we get
> without AVX. PSHUFD or PSHUFLW to copy+shuffle is cheaper. I guess I need
> to report that bug separately, because it probably won't get fixed soon: if
> I understand correctly, there's no mechanism for the back-end to tell the
> auto-vectorizer what shuffles it can do efficiently!
Also shuffles on GIMPLE do not expose the targets micro-ops so this is a
pure target issue. Please report separately.
>
> It usually won't make too much difference, but for very small arrays (like 8
> `int` elements) the hsum is a big part of the cost, although it's probably
> still worth auto-vectorizing *if* you can do it efficiently.
So I'm closing this bug again.