[Bug tree-optimization/82137] New: auto-vectorizing shuffles way to much to avoid duplicate work

peter at cordes dot ca Thu, 07 Sep 2017 19:07:34 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82137


            Bug ID: 82137
           Summary: auto-vectorizing shuffles way to much to avoid
                    duplicate work
           Product: gcc
           Version: 8.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: peter at cordes dot ca
  Target Milestone: ---

(same code as bug 82136, but with aligned pointers, and discussing the overall
vectorization strategy)

static const int aligned = 1;

void pairs_double(double blocks[]) {
    if(aligned) blocks = __builtin_assume_aligned(blocks, 64);
    for (int i = 0 ; i<10240 ; i+=2) {
        double x = blocks[i];
        double y = blocks[i+1];
        blocks[i] = x*y;
        blocks[i+1] = x+y;
    }
}

https://godbolt.org/g/pTY5iU has this + a uint64_t version (with ^ and &)

i.e. load a pair of 64-bit elements and replace them with 2 different
combinations of the pair.  See
https://stackoverflow.com/questions/46038401/unpack-m128i-m256i-to-m64-mmx-sse2-avx2.

gcc auto-vectorizes this *way* too much shuffling when AVX / AVX2 are
available, same for the integer version.  It's especially bad with AVX1 only,
but even AVX2 is a mess and will totally bottleneck on shuffle throughput on
Intel and AMD CPUs (Ryzen has good shuffle throughput, but lane-crossing
shuffles are much more expensive than in-lane).

gcc 8.0.8 -O3 -march=haswell

pairs_double:
        leaq    81920(%rdi), %rax
.L2:
        vmovapd (%rdi), %ymm1
        vmovapd 32(%rdi), %ymm0
        addq    $64, %rdi
        vunpcklpd       %ymm0, %ymm1, %ymm2
        vunpckhpd       %ymm0, %ymm1, %ymm0
        vpermpd $216, %ymm2, %ymm2
        vpermpd $216, %ymm0, %ymm0
        vmulpd  %ymm2, %ymm0, %ymm1
        vaddpd  %ymm2, %ymm0, %ymm0
        vpermpd $68, %ymm0, %ymm3
        vpermpd $238, %ymm0, %ymm0
        vpermpd $68, %ymm1, %ymm2
        vpermpd $238, %ymm1, %ymm1
        vshufpd $12, %ymm3, %ymm2, %ymm2
        vshufpd $12, %ymm0, %ymm1, %ymm0
        vmovapd %ymm2, -64(%rdi)
        vmovapd %ymm0, -32(%rdi)
        cmpq    %rdi, %rax
        jne     .L2
        vzeroupper
        ret

I haven't worked through the constants to see how its shuffling, but the same
function with uint64_t (and operators ^ &) uses 4 shuffles per 2 vectors after
the actual work:

        # vpand / vpxor in ymm0 / ymm1
        vpunpcklqdq     %ymm0, %ymm1, %ymm2
        vpunpckhqdq     %ymm0, %ymm1, %ymm0
        vperm2i128      $32, %ymm0, %ymm2, %ymm1
        vperm2i128      $49, %ymm0, %ymm2, %ymm0
        vmovdqa %ymm1, -64(%rdi)
        vmovdqa %ymm0, -32(%rdi)

It uses equivalent shuffles to prepare for vpand/vpxor, so vunpckl/hpd + 2x
vperm2f128 should do the same shuffle.  Using split stores would help a lot on
Intel and AMD: 2x vmovdqa xmm, then 2x vextractf128 xmm.  SnB/HSW/SKL
vextractf128 to memory is 2 fused-domain uops (Agner Fog's table) but there's
no ALU uop.  It's just a non-micro-fused store.  Since gcc's code horribly
bottlenecks on shuffles, split stores would be a win here.

It's a huge win on Ryzen, where vperm2f128 is 8 uops with 3c throughput. 
(-march=ryzen enables -mprefer-avx128, but -march=haswell running on Ryzen is
probably a relevant use-case.  When all else is equal for Intel, do something
that doesn't suck on Ryzen.)

----

However, a different strategy entirely is *MUCH* better for most CPUs, and as a
bonus it only needs AVX1 and no lane-crossing shuffles:

Do far less shuffling in exchange for doing more mul/add (or whatever), by
allowing duplicates.  Doing the same operation twice is allowed, even when it
can raise FP exceptions (in C and C++).

    b0     b1       |    b2       b3       # load 256b
    b1     b0       |    b3       b2       # vshufpd or vpermilpd for in-lane
swap

    b0*b1  b1*b0    |    b2*b3    b3*b2    # vmulpd
    b0+b1  b1+b0    |    b2+b3    b3+b2    # vaddpd

    b0*b1  b0+b1    |    b2*b3    b2+b3    # vblendpd
                                           # store 256b

Per two 128b pairs, that's 1 load, 1 shuffle, 2x FP math, 1 immediate-blend, 1
store.  No lane-crossing shuffles is a huge plus for Ryzen (vperm2f128 is very
slow).

That's 6 fused-domain uops on Intel, and should easily run at 1 iter per 1.5
cycles (+ loop overhead), without bottlenecking on any ports.  The bottleneck
is the front-end, so unrolling helps.

That's 0.75c per 128b pair.  (Slower on SnB/IvB, bottlenecking on load/store
throughput.)

Immediate-blend can run on any port on HSW+, or p0/p5 on SnB (and good
throughput on BD/Ryzen) so it's much more throughput-friendly than using
vunpcklpd to mix the mul and add result vectors.  If the operation hadn't been
commutative (e.g. sub instead of add), you might have to use vpunpck.

This same strategy is also optimal for integer booleans with AVX2.  (Or
possibly even with AVX1, but using VXORPS / VANDPS will bottleneck on port5 on
SnB-Broadwell, since it wasn't until Skylake that FP booleans run on p015).

[Bug tree-optimization/82137] New: auto-vectorizing shuffles way to much to avoid duplicate work

Reply via email to