https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846
Peter Cordes <peter at cordes dot ca> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|RESOLVED |REOPENED Resolution|FIXED |--- --- Comment #21 from Peter Cordes <peter at cordes dot ca> --- (In reply to Richard Biener from comment #20) > Fixed. Unfortunately only fixed for integer, not FP. The OpenMP and vanilla float array sum functions from the godbolt link in the initial bug report still use 256b shuffles, including a gratuitous vperm2f128 when the upper half isn't used, so vextractf128 would have done the same job in 1 uop on Ryzen instead of 8. Even on Intel CPUs, they're optimized for code-size, not performance (vhaddps instead of shuffle / vaddps). Remember that Intel CPUs with AVX only have one FP shuffle unit. (Including Sandy/Ivybridge, which has 2 integer-128 shuffle units) float sumfloat_autovec(const float arr[]) { arr = __builtin_assume_aligned(arr, 64); float sum=0; for (int i=0 ; i<1024 ; i++) sum = sum + arr[i]; return sum; } # gcc 20180113 -mavx2 -ffast-math -O3 # (tune=generic, and even with arch=znver1 no-prefer-avx128) ... vhaddps %ymm0, %ymm0, %ymm0 vhaddps %ymm0, %ymm0, %ymm1 vperm2f128 $1, %ymm1, %ymm1, %ymm0 # why not vextract? vaddps %ymm1, %ymm0, %ymm0 # gratuitous 256b vzeroupper This bug is still present for FP code: it narrows from 256b to scalar only in the last step. Every VHADDPS is 2 shuffles + 1 add on Intel. They're in-lane shuffles, but it's still 2 uops for port5 vs. VSHUFPS + VADDPS. (Costing an extra cycle of latency because with only 1 shuffle port, the 2 interleave-shuffles that feed a vertical-add uop can't run in the same cycle.) (V)HADDPS with the same input twice is almost never the best choice for performance. On Ryzen it's an even bigger penalty: HADDPS xmm is 4 uops (vs. 3 on Intel). It's also 7c latency (vs. 3 for ADDPS). 256b VHADDPS ymm is 8 uops, one per 3 cycle throughput, and Agner Fog reports that it's "mixed domain", i.e. some kind of penalty for ivec / fp domain crossing. I guess the shuffles it uses internally are ivec domain? With multiple threads on the same core, or even with ILP with surrounding code, uop throughput matters as well as latency, so more uops is worse even if it didn't have latency costs. The sequence I'd recommend (for both Intel and AMD) is: (See also http://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal-float-vector-sum-on-x86/35270026#35270026) vextractf128 $1, %ymm0, %xmm1 vaddps %xmm1, %xmm0, %xmm0 # narrow to 128b vmovshdup %xmm0, %xmm0, %xmm1 # copy high->low in each pair vaddps %xmm1, %xmm0, %xmm0 vmovhlps %xmm0, %xmm0, %xmm1 # duplicate high 64b vaddps %xmm1, %xmm0, %xmm0 The MOVSHDUP / MOVHLPS sequence is also what you want without VEX, so you can do a 128b hsum with 4 instructions, with no MOVAPS. Intel: 6 uops total, 3 shuffles. vs. 8 total, 5 shuffles AMD Ryzen: 6 uops, 3 shuffles. vs. 26 total uops, 20 of them shuffles. And much worse latency, too. Even just fixing this specific bug without fixing the rest of the sequence would help AMD *significantly*, because vextractf128 is very cheap, and vhaddps xmm is only half the uops of ymm. (But the latency still sucks). ----- Even for integer, this patch didn't fix the MOVDQA + PSRLDQ that we get without AVX. PSHUFD or PSHUFLW to copy+shuffle is cheaper. I guess I need to report that bug separately, because it probably won't get fixed soon: if I understand correctly, there's no mechanism for the back-end to tell the auto-vectorizer what shuffles it can do efficiently! It usually won't make too much difference, but for very small arrays (like 8 `int` elements) the hsum is a big part of the cost, although it's probably still worth auto-vectorizing *if* you can do it efficiently.