[Bug target/92246] New: Byte or short array reverse loop auto-vectorized with 3-uop vpermt2w instead of 1 or 2-uop vpermw (AVX512)

peter at cordes dot ca Sun, 27 Oct 2019 17:10:57 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92246


            Bug ID: 92246
           Summary: Byte or short array reverse loop auto-vectorized with
                    3-uop vpermt2w instead of 1 or 2-uop vpermw (AVX512)
           Product: gcc
           Version: 10.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: peter at cordes dot ca
  Target Milestone: ---
            Target: x86_64-*-*, i?86-*-*

typedef short swapt;
void strrev_explicit(swapt *head, long len)
{
  swapt *tail = head + len - 1;
  for( ; head < tail; ++head, --tail) {
      swapt h = *head, t = *tail;
      *head = t;
      *tail = h;
  }
}

g++ -O3 -march=skylake-avx512
  (Compiler-Explorer-Build) 10.0.0 20191022 (experimental)

https://godbolt.org/z/LS34w9

        ...
.L4:
        vmovdqu16       (%rdx), %ymm1
        vmovdqu16       (%rax), %ymm0
        vmovdqa64       %ymm1, %ymm3        # useless copy
        vpermt2w        %ymm1, %ymm2, %ymm3
        vmovdqu16       %ymm3, (%rax)
        vpermt2w        %ymm0, %ymm2, %ymm0
        addq    $32, %rax
        vmovdqu16       %ymm0, (%rcx)
        subq    $32, %rdx
        subq    $32, %rcx       # two tail pointers, PR 92244 is unrelated to
this
        cmpq    %rsi, %rax
        jne     .L4

vpermt2w ymm is 3 uops on SKX and CannonLake:  2p5 + p015
(https://www.uops.info/table.html)

Obviously better would be  vpermw (%rax), %ymm2, %ymm0.

vpermw apparently can't micro-micro-fuse a load, but it's only 2 ALU uops plus
a load if we use a memory source.  SKX still bottlenecks on 2p5 for vpermw,
losing only the p015 uop, but in general fewer uops is better.

But on CannonLake it runs on p01 + p5 (plus p23 with a memory source).

uops.info doesn't have IceLake-client data yet but vpermw throughput on IceLake
is 1/clock, vs 1 / 2 clocks for vpermt2w, so this could double throughput on
CNL and ICL.

We have exactly the same problem with AVX512VBMI vpermt2b over vpermb with ICL
g++ -O3 -march=icelake-client -mprefer-vector-width=512

[Bug target/92246] New: Byte or short array reverse loop auto-vectorized with 3-uop vpermt2w instead of 1 or 2-uop vpermw (AVX512)

Reply via email to