https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92246
Bug ID: 92246 Summary: Byte or short array reverse loop auto-vectorized with 3-uop vpermt2w instead of 1 or 2-uop vpermw (AVX512) Product: gcc Version: 10.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* typedef short swapt; void strrev_explicit(swapt *head, long len) { swapt *tail = head + len - 1; for( ; head < tail; ++head, --tail) { swapt h = *head, t = *tail; *head = t; *tail = h; } } g++ -O3 -march=skylake-avx512 (Compiler-Explorer-Build) 10.0.0 20191022 (experimental) https://godbolt.org/z/LS34w9 ... .L4: vmovdqu16 (%rdx), %ymm1 vmovdqu16 (%rax), %ymm0 vmovdqa64 %ymm1, %ymm3 # useless copy vpermt2w %ymm1, %ymm2, %ymm3 vmovdqu16 %ymm3, (%rax) vpermt2w %ymm0, %ymm2, %ymm0 addq $32, %rax vmovdqu16 %ymm0, (%rcx) subq $32, %rdx subq $32, %rcx # two tail pointers, PR 92244 is unrelated to this cmpq %rsi, %rax jne .L4 vpermt2w ymm is 3 uops on SKX and CannonLake: 2p5 + p015 (https://www.uops.info/table.html) Obviously better would be vpermw (%rax), %ymm2, %ymm0. vpermw apparently can't micro-micro-fuse a load, but it's only 2 ALU uops plus a load if we use a memory source. SKX still bottlenecks on 2p5 for vpermw, losing only the p015 uop, but in general fewer uops is better. But on CannonLake it runs on p01 + p5 (plus p23 with a memory source). uops.info doesn't have IceLake-client data yet but vpermw throughput on IceLake is 1/clock, vs 1 / 2 clocks for vpermt2w, so this could double throughput on CNL and ICL. We have exactly the same problem with AVX512VBMI vpermt2b over vpermb with ICL g++ -O3 -march=icelake-client -mprefer-vector-width=512