double conversion should use half-width memory operands to avoid shuffles, instead of load+extract

peter at cordes dot ca Sat, 29 Apr 2017 23:38:40 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80570


            Bug ID: 80570
           Summary: auto-vectorizing int->double conversion should use
                    half-width memory operands to avoid shuffles, instead
                    of load+extract
           Product: gcc
           Version: 8.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: peter at cordes dot ca
  Target Milestone: ---
            Target: x86_64-*-*, i?86-*-*

When auto-vectorizing int->double conversion, gcc loads a full-width vector
into a register and then unpacks the upper half to feed (v)cvtdq2pd.  e.g. with
AVX, we get a 256b load and then vextracti128.

It's even worse with an unaligned src pointer with
-mavx256-split-unaligned-load, where it does vinsertf128 -> vextractf128,
without ever doing anything with the full 256b vector!

On Intel SnB-family CPUs, this will bottleneck the loop on port5 throughput,
because VCVTDQ2PD reg -> reg needs a port5 uop as well as a port1 uop.  (And
vextracti128 can only run on the shuffle unit on port5).

VCVTDQ2PD with a memory source operand doesn't need the shuffle port at all on
Intel Haswell and later, just the FP-add unit and a load, so it's a much better
choice.  (Throughput of one per clock on Sandybridge and Haswell, 2 per clock
on Skylake).  It's still 2 fused-domain uops, though, so I guess it can't
micro-fuse the load according to Agner Fog's testing.  (Or 3 on SnB).

I'm pretty sure using twice as many half-width memory operands is not worse on
other AVX CPUs either (AMD BD-family or Zen, or KNL), vs. max-width loads and
extracting the high half.


void cvti32f64_loop(double *dp, int *ip) {
// ICC avoids the mistake when it doesn't emit a prologue to align the pointers
#ifdef __GNUC__
    dp = __builtin_assume_aligned(dp, 64);
    ip = __builtin_assume_aligned(ip, 64);
#endif
    for (int i=0; i<10000 ; i++) {
        double tmp = ip[i];
        dp[i] = tmp;
    }
}

https://godbolt.org/g/329C3P
gcc.godbolt.org's "gcc7" snapshot: g++ (GCC-Explorer-Build) 8.0.0 20170429
(experimental)

gcc -O3 -march=sandybridge
cvti32f64_loop:
        xorl    %eax, %eax
.L2:
        vmovdqa (%rsi,%rax), %ymm0
        vcvtdq2pd       %xmm0, %ymm1
        vextractf128    $0x1, %ymm0, %xmm0
        vmovapd %ymm1, (%rdi,%rax,2)
        vcvtdq2pd       %xmm0, %ymm0
        vmovapd %ymm0, 32(%rdi,%rax,2)
        addq    $32, %rax
        cmpq    $40000, %rax
        jne     .L2
        vzeroupper
        ret

gcc does the same thing for -march=haswell, but uses vextracti128.  This is
obviously really silly.

For comparison, clang 4.0 -O3 -march=sandybridge -fno-unroll-loops emits:
        xorl    %eax, %eax
.LBB0_1:
        vcvtdq2pd       (%rsi,%rax,4), %ymm0
        vmovaps %ymm0, (%rdi,%rax,8)
        addq    $4, %rax
        cmpq    $10000, %rax            # imm = 0x2710
        jne     .LBB0_1
        vzeroupper
        retq

This should come close to one 256b store per clock (on Haswell), even with
unrolling disabled.

----

With -march=nehalem, gcc gets away with it for this simple not-unrolled loop
(without hurting throughput I think), but only because this strategy
effectively unrolls the loop (doing two stores per add + cmp/jne), and Nehalem
can run shuffles on two execution ports (so the pshufd can run on port1, while
the cvtdq2pd can run on ports 1+5).  So it's 10 fused-domain uops per 2 stores
instead of 5 per 1 store.  Depending on how the loop buffer handles
non-multiple-of-4 uop counts, this might be a wash.  (Of course, with any other
work in the loop, or with unrolling, the memory-operand strategy is much
better).

CVTDQ2PD's memory operand is only 64 bits, so even the non-AVX version doesn't
fault if misaligned.

------

It's even more horrible without aligned pointers, when the sandybridge version
(which splits unaligned 256b loads/stores) uses vinsertf128 to emulate a 256b
load, and then does vextractf128 right away:

     inner_loop:   # gcc8 -march=sandybridge without __builtin_assume_aligned
        vmovdqu (%r8,%rax), %xmm0
        vinsertf128     $0x1, 16(%r8,%rax), %ymm0, %ymm0
        vcvtdq2pd       %xmm0, %ymm1
        vextractf128    $0x1, %ymm0, %xmm0
        vmovapd %ymm1, (%rcx,%rax,2)
        vcvtdq2pd       %xmm0, %ymm0
        vmovapd %ymm0, 32(%rcx,%rax,2)

This is obviously really really bad, and should probably be checked for and
avoided in case there are things other than int->double autovec that could lead
to doing this.

---

With -march=skylake-avx512, gcc does the AVX512 version of the same thing: zmm
load and then extra the upper 256b

[Bug target/80570] New: auto-vectorizing int->double conversion should use half-width memory operands to avoid shuffles, instead of load+extract

Reply via email to