https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82369

            Bug ID: 82369
           Summary: "optimizes" indexed addressing back into two pointer
                    increments
           Product: gcc
           Version: 8.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: peter at cordes dot ca
  Target Milestone: ---
            Target: x86_64-*-*, i?86-*-*

gcc defeats this attempt to get it to reduce the front-end bottleneck in this
loop (simplified from a version of the loop in pr82356).

Indexing src by  (dst-src) + src  is easy to do in C, and works well.  But when
one pointer advances faster than the other it's very clunky to express in C.

#include <immintrin.h>
#include <stdint.h>
#include <stddef.h>

// index src relative to dst, but use a pointer-increment for dst
// so the store still has a simple addressing mode (and can run on port7)
// gcc and clang "optimize" back to two separate pointers, but ICC13 leaves it
alone
// Saves one ADD instruction in the loop.
void pack_high8_indexed_src(uint8_t *restrict dst, const uint16_t *restrict
src, size_t bytes) {
  uintptr_t end_dst = (uintptr_t)(dst + bytes);
  uintptr_t srcu = (uintptr_t)src, dstu = (uintptr_t)dst;

  ptrdiff_t src_dst_offset = srcu - 2*dstu;
  do{
     __m128i v0 = _mm_loadu_si128((__m128i*)(dstu*2+src_dst_offset));
     __m128i v1 = _mm_loadu_si128((__m128i*)(dstu*2+src_dst_offset)+1);
     __m128i res = _mm_packus_epi16(v1,v0);

     _mm_storeu_si128((__m128i*)dstu, res);
     dstu += 16;
     //src += 16;  // 32 bytes
  } while(dstu < end_dst);
}

https://godbolt.org/g/pycLQC
gcc -O3 -mtune=skylake  de-optimizes it to this:

pack_high8_indexed_src:       # gcc and clang do this:
        addq    %rdi, %rdx
.L2:
        movdqu  16(%rsi), %xmm0
        movdqu  (%rsi), %xmm1
        addq    $16, %rdi
        addq    $32, %rsi                # 2 separate pointer increments
        packuswb        %xmm1, %xmm0
        movups  %xmm0, -16(%rdi)
        cmpq    %rdi, %rdx
        ja      .L2
        ret

Intel SnB-family: 7 fused-domain uops.  (The store micro-fuses, and the cmp/ja
macro-fuses).  In theory, this bottlenecks on front-end throughput (4 uops per
clock), running at 1 iter per 1.75 cycles.  The store uses a simple addressing
mode, so its store-address uop can run on port7.  If not for the front-end
bottleneck, the back-end could run this at nearly 1 per clock.

ICC13/16/17 compiles it the way I was hoping to hand-hold gcc into doing, to 6
fused-domain uops, and should run 1 iter per 1.5 clocks on SnB/HSW/SKL.  This
might also be good on Silvermont, since it's fewer instructions.

Possibly a similar benefit on K10 / BD (although AMD would benefit from using
simple array indexing, because indexed addressing modes for stores aren't worse
AFAIK.  But -mtune=bdver2 doesn't do that.)

pack_high8_indexed_src:               # ICC17
        lea       (%rdi,%rdi), %rax
        negq      %rax
        addq      %rdi, %rdx
        addq      %rax, %rsi
..B1.2:
        movdqu    16(%rsi,%rdi,2), %xmm1           # src indexed via dst*2
        movdqu    (%rsi,%rdi,2), %xmm0
        packuswb  %xmm0, %xmm1
        movdqu    %xmm1, (%rdi)                    # dst with a simple
addressing mode.
        addq      $16, %rdi                        # 16B of dst, 32B of src
        cmpq      %rdx, %rdi
        jb        ..B1.2
        ret

A mov-load with a complex addressing mode is a single uop on all CPUs.  It
might have 1c higher latency than a simple addressing mode, but that doesn't
matter when the address math is off the critical path.

With unrolling, the actual work is only 4 fused-domain uops for 2x load + pack
+ store, so the front-end can just barely keep the back-end fed with infinite
unrolling.  For any sane unroll factor, saving 1 uop of loop overhead is a
slight win.

A store with an indexed addressing-mode can't run on port7 on Haswell/Skylake. 
With any unrolling, that would become a bottleneck.  On SnB/IvB, indexed stores
are un-laminated into 2 fused-domain uops, so simple array-indexing gets worse
with unrolling.


BTW, with an indexed store, we could count a negative index up towards zero. 
That would avoid the CMP, since the loop overhead could be just a single
macro-fused uop: add $16, %rdx / jnc.  (But only SnB-family macro-fuses
add/jcc.  AMD and Core2/Nehalem only macro-fuse test/cmp.)  But on a CPU that
doesn't macro-fuse at all, it's good.  (e.g. Silvermont / KNL).

---

BTW, with AVX, micro-fused loads are un-laminated on Haswell/Skylake.  e.g.

        vmovdqu   16(%rsi,%rdi,2), %xmm0
        vpackuswb (%rsi,%rdi,2), %xmm0, %xmm1
        vmovdqu   %xmm1, (%rdi)

is 3 fused-domain uops in the decoders/uop cache, but its 4 fused-domain uops
for the issue/rename stage and in the ROB.  The vpackuswb un-laminates.
https://stackoverflow.com/questions/26046634/micro-fusion-and-addressing-modes#comment76198723_31027695

So if unrolling with AVX, it's better to do what gcc does and increment 2
separate pointers.  Then we can actually keep the back-end fed and bottleneck
on load throughput, store throughput, and shuffle throughput.  gcc can unroll
this loop (but clang can't, maybe confused by using integers as pointers.)

packuswb (%rsi,%rdi,2), %xmm0  could stay micro-fused, because it's a 2-operand
instruction with a read-modify destination (not write-only like pabsb).  But we
can't use it because it requires alignment.  (Of course, with load instead of
loadu, this indexing trick would still be a win.)

Reply via email to