cleanup code when peeling

peter at cordes dot ca Tue, 15 Dec 2015 16:19:38 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68928


            Bug ID: 68928
           Summary: AVX loops on unaligned arrays could generate more
                    efficient startup/cleanup code when peeling
           Product: gcc
           Version: 5.3.0
            Status: UNCONFIRMED
          Keywords: missed-optimization, ssemmx
          Severity: enhancement
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: peter at cordes dot ca
  Target Milestone: ---
            Target: x86-64-*-*

I have some suggestions for better code that gcc could use for the
prologue/epilogue when vectorizing loops over unaligned buffers.  I haven't
looked at gcc's code, just the output, so IDK how one might get gcc to
implement these.

---------

Consider the following code:

#include <immintrin.h>

typedef float float_align32 __attribute__ ((aligned (32)));
void floatmul_aligned(float_align32 *a) {
  for (int i=0; i<1024 ; i++)
    a[i] *= 2;
}


void floatmul(float *a) {
  for (int i=0; i<1024 ; i++)
    a[i] *= 2;
}

g++ 5.3.0 -O3 -march=sandybridge emits what you'd expect for the aligned
version: 

floatmul_aligned(float*):
        leaq    4096(%rdi), %rax
.L2:
        vmovaps (%rdi), %ymm0
        addq    $32, %rdi
        vaddps  %ymm0, %ymm0, %ymm0
        vmovaps %ymm0, -32(%rdi)
        cmpq    %rdi, %rax
        jne     .L2
        vzeroupper
        ret

*** off-topic ***

It unfortunately uses 5 uops in the loop, meaning it can only issue one
iteration per 2 clocks.  Other than unrolling, it would prob. be more efficient
to get 2.0f broadcast into %ymm1 and use vmulps (%rdi), %ymm1, %ymm0, avoiding
the separate load.

Doing the loop in reverse order, with an indexed addressing mode counting an
index down to zero, would also keep the loop overhead down to one
decrement-and-branch uop.  I know compilers are allowed to re-order memory
accesses, so I assume this would be allowed.  However, this wouldn't actually
help on Sandybridge since it seems that two-register addressing modes might not
micro-fuse on SnB-family CPUs:
(http://stackoverflow.com/questions/26046634/micro-fusion-and-addressing-modes.
 Agner Fog says he tested and found 2-reg addressing modes did micro-fuse. 
Agner Fog is probably right, but IDK what's wrong with my experiment using perf
counters.)  That would make the store 2 uops.


*** back on topic ***

Anyway, that wasn't even what I meant to report.  The unaligned case peels off
the potentially-unaligned start/end iterations, and unrolls them into a giant
amount of code.  This is unlikely to be optimal outside of microbenchmarks,
since CPUs with a uop-cache suffer from excessive unrolling.

floatmul(float*):
        movq    %rdi, %rax
        andl    $31, %eax
        shrq    $2, %rax
        negq    %rax
        andl    $7, %eax
        je      .L12
        vmovss  (%rdi), %xmm0
        vaddss  %xmm0, %xmm0, %xmm0
        vmovss  %xmm0, (%rdi)
        cmpl    $1, %eax
        je      .L13
        vmovss  4(%rdi), %xmm0
        vaddss  %xmm0, %xmm0, %xmm0
        vmovss  %xmm0, 4(%rdi)
        cmpl    $2, %eax
        je      .L14
        vmovss  8(%rdi), %xmm0
        ...

        repeated up to  cmpl    $6, %eax
        ...
        some loop setup
.L9:
        vmovaps (%rcx,%rax), %ymm0
        addl    $1, %edx
        vaddps  %ymm0, %ymm0, %ymm0
        vmovaps %ymm0, (%rcx,%rax)
        addq    $32, %rax
        cmpl    %esi, %edx
        jb      .L9

        ...
        another fully-unrolled up-to-7 iteration cleanup loop

Notice that the vectorized part of the loop now has 6 uops.  (Or 7, if the
store can't micro-fuse.)  So gcc is even farther from getting this loop to run
at one cycle per iteration.  (Which should be possible on Haswell.  On SnB/IvB
(and AMD Bulldozer-family), a 256b store takes two cycles anyway.)


Is there any experimental evidence that fully unrolling to make this much code
is beneficial?

The most obvious way to improve on this would be to use a 128b xmm vector for
the first 4 iterations of the prologue/epilogue loops.

Even simply not unrolling the 7-iteration alignment loops might be a win. 
Every unrolled iteration still has a compare-and-branch.  By counting down to
zero, the loop could have the same overhead.  All that changes is branch
prediction (one taken branch and many not-taken, vs. a single loop branch taken
n times.)

AVX introduces a completely different way to handle this, though: VMASKMOVPS is
usable now, since it doesn't have the non-temporal hint that makes the SSE
version of it nearly useless.  According to Agner Fog's insn tables, vpmaskmov
%ymm, %ymm, m256 is only 4 uops, and has a throughput of one per 2 cycles
(SnB/IvB/Haswell).  It's quite slow (as a store) on AMD bulldozer-family CPUs,
though, so this might only be appropriate with -tune=something other than AMD.

The trouble is turning a misalignment count into a mask.  Most of the useful
instructions (like PSRLDQ to use on a vector of all-ones) are only available
with immediate counts.  Keeping an array of 7 256b masks seems like a big
waste, and having a switch() to run the byte-shift instruction with one of 7
different immediate operands also sucks.  (And doesn't work because it work
in-lane, not across both 256b lanes).  PSLLQ can take a shift count in the low
qword of an xmm register, but I'm not sure it helps.

My best idea for generating a mask for VMASKMOVPS requires AVX2: broadcast the
misalignment count to all bytes of a ymm register (VPBROADCASTB).  Use VPCMPGTB
with another 256b constant (LSB first): { 7,7,7,7, 6,6,6,6, 5,5,5,5, 4,4,4,4,
3,3,3,3, 2,2,2,2, 1,1,1,1, 0,0,0,0 }.  The least significant 4B will always be
0, since the count is in the 0-7 range, and 7>7 is false. (Or put the count
into the 1-8 range so we do 32B of useful work in the aligned case, instead of
zero, without branching.)

The constant could shrink to 64b {7,6,5,4,3,2,1,0}, with an extra instruction:

    vmovd        %eax,  %xmm0      # %eax = misalignment count
    vpbroadcastb %xmm0, %xmm0      # broadcast the low byte to 128b
    vpcmpgtb    .constant, %xmm0, %xmm1   # make sure the 64b .constant is at
least 128b from the end of a cache line
      # Our mask is in the low 8 bytes of %xmm1.  The upper 64b is garbage
    vpmovsxbd    %xmm1, %ymm1      # all-ones sign-extends to all-ones

Or if reading "garbage" is too hacky, expand to 256b sooner:

    vmovd        %eax,  %xmm0      # %eax = misalignment count
    vpbroadcastd %xmm0, %ymm0      # broadcast the low 32b
    vpmovzxbd    .constant, %ymm1  # 2 uops, unlike the other forms.
    vpcmpgtd     %ymm1, %ymm0, %ymm2


Haswell: first version is movd: 1uop(p5). 1/1.  vpbroadcastb x,x: 1uop(p5). 1/1
vpcmpgtb x,mem: 1uop(p15). 1/0.5   vpmovzxbd y,x: 1uop(p5), 3/1.
total: 4 uops, 6c latency, 1 per 2c throughput (bottleneck on p5).

second version: movd: 1uop. 1/1.  vpbroadcastb y,x: 1uop. 3/1.
vpcmpgtd y,y: 1uop. 1/0.5   vpmovzxbd ymm, m64: 2uops(p5+p23), ?/1.
total: 5 uops, 5c latency, 1 per 2c throughput (saturating p5).  The movzx is
off the critical path, but can't micro-fuse.


Without AVX2: 

    #### untested / possibly buggy
    imul         $0x1010101, %edx, %eax  # broadcast the misalignment count to
the 4 bytes of %eax
    vmovd        %eax,  %xmm0
    vpshufd      $0, %xmm0, %xmm0    # broadcast to all bytes of xmm0  

    vpcmpgtb    .constant2, %xmm0, %xmm1   # constant2 = .byte 0,0, 1,1, ...
    vpmovsxwd    %xmm1, %xmm2              # [32b:count>0, 32b:count>1, ...]
    vpunpckhwd   %xmm1, %xmm1, %xmm3       # [32b:count>4, 32b:count>5, ...]

    vinsertf128  $1, %xmm2, %ymm3,  %ymm3  # combine the masks

    # do a masked load, too, to avoid possible NaN slowdowns if there's garbage
before the array
    vmaskmov     (%rdi), %ymm2, %ymm3       # rdi = p & ~0x1F: rounded down to
previous 32B boundary
    vaddps       %ymm3, %ymm3, %ymm3
    vmaskmov     %ymm3, %ymm2, (%rdi)

    mainloop:
      ...
    # epilogue: invert the mask


Or skip the vinsertf and do two separate vaddps xmm / vmaskmov, but that's
probably worse.  The addresses used with vmaskmov will be aligned: its the
masking that takes care of unaligned accesses.

Also, we could broadcast the count just to words, and use vpcmpgtw.  We're
using words rather than bytes because there's no single instruction to pmovsxbd
from the 2nd 32b chunk of a source register (to unpack the high half of the
mask).  punpckh same,same can read from the high half of a reg and double the
size of the elements, though.  An extra pmovsxbw or punpcklbw would let us use
a 64b constant, though.

Anyway, IDK if this idea is generally useful for gcc to handle arrays that
aren't guaranteed to be aligned.

Probably things should be arranged so that in the aligned case, either the mask
generation and vmaskmovps are skipped altogether, or that the vmaskmovps does a
full 256b load/store, rather than a fully-masked 0-byte load/store.  Besides
the obvious reason of avoiding wasted work, AMD Jaguar's VMASKMOVPS takes ~300
clocks for a load with mask=0, vs. 2 clocks (15 cycle latency) in the normal
case.  VMASKMOVPS 256b store on Jaguar has one per 22c throughput, though, and
takes 36m-ops.  So it's not worth using if targeting jaguar, but avoiding
Jaguar's catastrophic case is a good idea even when tuning for something else,
since it's probably a good idea anyway.

[Bug target/68928] New: AVX loops on unaligned arrays could generate more efficient startup/cleanup code when peeling

Reply via email to