https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82369
Bug ID: 82369 Summary: "optimizes" indexed addressing back into two pointer increments Product: gcc Version: 8.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* gcc defeats this attempt to get it to reduce the front-end bottleneck in this loop (simplified from a version of the loop in pr82356). Indexing src by (dst-src) + src is easy to do in C, and works well. But when one pointer advances faster than the other it's very clunky to express in C. #include <immintrin.h> #include <stdint.h> #include <stddef.h> // index src relative to dst, but use a pointer-increment for dst // so the store still has a simple addressing mode (and can run on port7) // gcc and clang "optimize" back to two separate pointers, but ICC13 leaves it alone // Saves one ADD instruction in the loop. void pack_high8_indexed_src(uint8_t *restrict dst, const uint16_t *restrict src, size_t bytes) { uintptr_t end_dst = (uintptr_t)(dst + bytes); uintptr_t srcu = (uintptr_t)src, dstu = (uintptr_t)dst; ptrdiff_t src_dst_offset = srcu - 2*dstu; do{ __m128i v0 = _mm_loadu_si128((__m128i*)(dstu*2+src_dst_offset)); __m128i v1 = _mm_loadu_si128((__m128i*)(dstu*2+src_dst_offset)+1); __m128i res = _mm_packus_epi16(v1,v0); _mm_storeu_si128((__m128i*)dstu, res); dstu += 16; //src += 16; // 32 bytes } while(dstu < end_dst); } https://godbolt.org/g/pycLQC gcc -O3 -mtune=skylake de-optimizes it to this: pack_high8_indexed_src: # gcc and clang do this: addq %rdi, %rdx .L2: movdqu 16(%rsi), %xmm0 movdqu (%rsi), %xmm1 addq $16, %rdi addq $32, %rsi # 2 separate pointer increments packuswb %xmm1, %xmm0 movups %xmm0, -16(%rdi) cmpq %rdi, %rdx ja .L2 ret Intel SnB-family: 7 fused-domain uops. (The store micro-fuses, and the cmp/ja macro-fuses). In theory, this bottlenecks on front-end throughput (4 uops per clock), running at 1 iter per 1.75 cycles. The store uses a simple addressing mode, so its store-address uop can run on port7. If not for the front-end bottleneck, the back-end could run this at nearly 1 per clock. ICC13/16/17 compiles it the way I was hoping to hand-hold gcc into doing, to 6 fused-domain uops, and should run 1 iter per 1.5 clocks on SnB/HSW/SKL. This might also be good on Silvermont, since it's fewer instructions. Possibly a similar benefit on K10 / BD (although AMD would benefit from using simple array indexing, because indexed addressing modes for stores aren't worse AFAIK. But -mtune=bdver2 doesn't do that.) pack_high8_indexed_src: # ICC17 lea (%rdi,%rdi), %rax negq %rax addq %rdi, %rdx addq %rax, %rsi ..B1.2: movdqu 16(%rsi,%rdi,2), %xmm1 # src indexed via dst*2 movdqu (%rsi,%rdi,2), %xmm0 packuswb %xmm0, %xmm1 movdqu %xmm1, (%rdi) # dst with a simple addressing mode. addq $16, %rdi # 16B of dst, 32B of src cmpq %rdx, %rdi jb ..B1.2 ret A mov-load with a complex addressing mode is a single uop on all CPUs. It might have 1c higher latency than a simple addressing mode, but that doesn't matter when the address math is off the critical path. With unrolling, the actual work is only 4 fused-domain uops for 2x load + pack + store, so the front-end can just barely keep the back-end fed with infinite unrolling. For any sane unroll factor, saving 1 uop of loop overhead is a slight win. A store with an indexed addressing-mode can't run on port7 on Haswell/Skylake. With any unrolling, that would become a bottleneck. On SnB/IvB, indexed stores are un-laminated into 2 fused-domain uops, so simple array-indexing gets worse with unrolling. BTW, with an indexed store, we could count a negative index up towards zero. That would avoid the CMP, since the loop overhead could be just a single macro-fused uop: add $16, %rdx / jnc. (But only SnB-family macro-fuses add/jcc. AMD and Core2/Nehalem only macro-fuse test/cmp.) But on a CPU that doesn't macro-fuse at all, it's good. (e.g. Silvermont / KNL). --- BTW, with AVX, micro-fused loads are un-laminated on Haswell/Skylake. e.g. vmovdqu 16(%rsi,%rdi,2), %xmm0 vpackuswb (%rsi,%rdi,2), %xmm0, %xmm1 vmovdqu %xmm1, (%rdi) is 3 fused-domain uops in the decoders/uop cache, but its 4 fused-domain uops for the issue/rename stage and in the ROB. The vpackuswb un-laminates. https://stackoverflow.com/questions/26046634/micro-fusion-and-addressing-modes#comment76198723_31027695 So if unrolling with AVX, it's better to do what gcc does and increment 2 separate pointers. Then we can actually keep the back-end fed and bottleneck on load throughput, store throughput, and shuffle throughput. gcc can unroll this loop (but clang can't, maybe confused by using integers as pointers.) packuswb (%rsi,%rdi,2), %xmm0 could stay micro-fused, because it's a 2-operand instruction with a read-modify destination (not write-only like pabsb). But we can't use it because it requires alignment. (Of course, with load instead of loadu, this indexing trick would still be a win.)