https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82459
Bug ID: 82459 Summary: AVX512F instruction costs: vmovdqu8 stores may be an extra uop, and vpmovwb is 2 uops on Skylake and not always worth using Product: gcc Version: 8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* gcc bottlenecks on shuffle uops when auto-vectorizing this for skylake-avx512 * Perhaps the cost model is wrong for vpmovwb (it's 2 port5 uops), or gcc doesn't consider any cheaper alternatives. My version with 2x shift + vpacksswb + vpermq has 3x the theoretical throughput (with hot caches). In general, AVX512BW lane crossing shuffles of 8 or 16-bit elements are multi-uop on SKX, but in-lane byte/word shuffles are single-uop just like their AVX2 versions. * Using vmovdqu8 as a store costs a port5 ALU uop even with no masking, according to Intel (not tested). We should always use AVX512F vmovdqu32 or 64 for unmasked loads/stores, not AVX512BW vmovdqu8 or 16. Intel's docs indicate that current hardware doesn't handle unmasked vmovdqu8/16 as efficiently as 32/64, and there's no downside. * Using vinserti64x4 instead of 2 separate stores is worse because it makes the shuffle bottleneck worse, and 2 stores wouldn't bottleneck on load/store throughput. (Avoiding vpmovwb makes this moot in this case, but presumably whatever decided to shuffle + store instead of store + store will make that mistake in other cases too.) SKX shuts down port 1 (except for scalar integer) when there are 512b uops in flight, so extra loads/stores are relatively cheaper than using more ALU uops, compared to 256b or 128b vectors where the back-end can keep up even when 3 of the 4 uops per clock are vector-ALU (if they go to different ports). #include <stdint.h> #include <stddef.h> void pack_high8_baseline(uint8_t *__restrict__ dst, const uint16_t *__restrict__ src, size_t bytes) { uint8_t *end_dst = dst + bytes; do{ *dst++ = *src++ >> 8; } while(dst < end_dst); } // https://godbolt.org/g/kXjEp1 gcc8 -O3 -march=skylake-avx512 .L5: # inner loop vmovdqa64 (%rsi,%rax,2), %zmm0 vmovdqa64 64(%rsi,%rax,2), %zmm1 vpsrlw $8, %zmm0, %zmm0 # memory operand not folded: bug 82370 vpsrlw $8, %zmm1, %zmm1 vpmovwb %zmm0, %ymm0 # 2 uops each vpmovwb %zmm1, %ymm1 vinserti64x4 $0x1, %ymm1, %zmm0, %zmm0 vmovdqu8 %zmm0, (%rcx,%rax) # Intel says this is worse than vmovdqu64 addq $64, %rax cmpq %rax, %rdi # using an indexed addr mode, but still doing separate add/cmp jne .L5 IACA says gcc's loop will run at one 64B store per 6 clocks, bottlenecked on 6 port5 uops (including the vmovdqu8. vmovdqu64 gives one store per 5 clocks, still bottlenecked on port5). Using 2 stores instead of vinserti64x4 gives us one store per 4 clocks. (Still twice as slow as with vpacksswb + vpermq, which produces one 512b vector per 2 shuffle uops instead of one 256b vector per 2 shuffle uops.) See https://stackoverflow.com/questions/26021337/what-is-iaca-and-how-do-i-use-it for more about Intel's static analysis tool. related: pr 82370 mentions vectorization strategies for this. Fortunately gcc doesn't unroll the startup loop to reach an alignment boundary. (And BTW, aligned pointers are more important with AVX512 than AVX2, in my testing with manual vectorization of other code on Skylake-avx512.) Of course, a potentially-overlapping unaligned first vector would be much better than a scalar loop here. ---- Anyway, does gcc know that vpmovwb %zmm, %ymm is 2 uops for port 5, while vpackuswb zmm,zmm,zmm in-lane 2-input shuffle is 1 uop (for port 5)? The xmm source version is single-uop, because it's in-lane. Source: Intel's IACA2.3, not testing on real hardware. SKX port-assignment spreadsheet: https://github.com/InstLatx64/InstLatx64/blob/master/AVX512_SKX_PortAssign_v102_PUB.ods It's based on IACA output for uops, but throughputs and latencies are from real hardware AIDA64 InstLatx64, with a 2nd column for Intel's published tput/latency (which as usual doesn't always match). vpmovwb real throughput is one per 2 clocks, which is consistent with being 2 uops for p5. It makes some sense from a HW-design perspective that all lane-crossing shuffles with element size smaller than 32-bit are multi-uop. It's cool that in-lane AVX512 vpshufb zmm vpacksswb zmm are single-uop, but it means it's often better to use more instructions to do the same work in fewer total shuffle uops. (Any loop that involves any shuffling can *easily* bottleneck on shuffle throughput.) Related: AVX512 merge-masking can something into a 2-input shuffle. But we vpsrlw $8, 64(%rsi), %zmm0{k1} doesn't work to get all our data into one vector, because it masks at word granularity, not byte. vmovdqu8 with masking needs an ALU uop (on SKX according to IACA). ------------- Here's a hand-crafted efficient version of the inner loop. It doesn't use any weird tricks (I haven't thought of any that are actually a win on SKX), so it should be possible to get gcc to emit something like this. .Lloop: vpsrlw $8, 0(%rsi), %zmm0 vpsrlw $8, 64(%rsi), %zmm1 vpackuswb %zmm1, %zmm0, %zmm0 # 1 uop for a 2-input shuffle vpermq %zmm7, %zmm0, %zmm0 # lane-crossing fixup for vpackuswb vmovdqu64 %zmm0, (%rdi, %rdx) add $(2*64), %rsi add $64, %rdx # counts up towards zero jnc .Lloop Note that the folded loads use non-indexed addressing modes so they can stay micro-fused. The store will stay micro-fused even with an indexed addressing mode, so we can count an index up towards zero (and index from the end of the array) saving a CMP instruction in the loop. (add/jnc will macro-fuse on SKX). A second pointer-increment + cmp/jcc would be ok. IACA thinks that indexed stores don't stay micro-fused, but that's only true for SnB/IvB. Testing on Haswell/Skylake (desktop) shows they do stay fused, and IACA is wrong about that. I think it's a good guess that AVX512 indexed stores will not un-laminate either. IACA analysis says it will run at 1x 64B store per 2 clocks. If the store stays micro-fused, it's really only 7 fused-domain uops per clock, so we have front-end bandwidth to spare and the only bottleneck is on the ALU ports (p0 and p5 saturated with shift and shuffle respectively). Because of the ALU bottleneck, I didn't need the store to be able to run on p7. p2/p3 only need to handle 1.5 uops per 2 clocks. Using a non-indexed store addressing mode would let it use p7, but that increases loop overhead slightly. $ iaca -arch SKX testloop.o Intel(R) Architecture Code Analyzer Version - 2.3 build:246dfea (Thu, 6 Jul 2017 13:38:05 +0300) Analyzed File - testloop.o Binary Format - 64Bit Architecture - SKX Analysis Type - Throughput Throughput Analysis Report -------------------------- Block Throughput: 2.00 Cycles Throughput Bottleneck: FrontEnd Port Binding In Cycles Per Iteration: --------------------------------------------------------------------------------------- | Port | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | --------------------------------------------------------------------------------------- | Cycles | 2.0 0.0 | 1.0 | 1.5 1.0 | 1.5 1.0 | 1.0 | 2.0 | 1.0 | 0.0 | --------------------------------------------------------------------------------------- N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0) D - Data fetch pipe (on ports 2 and 3), CP - on a critical path F - Macro Fusion with the previous instruction occurred * - instruction micro-ops not bound to a port ^ - Micro Fusion happened | Num Of | Ports pressure in cycles | | | Uops | 0 - DV | 1 | 2 - D | 3 - D | 4 | 5 | 6 | 7 | | --------------------------------------------------------------------------------- | 2^ | 1.0 | | 1.0 1.0 | | | | | | CP | vpsrlw zmm0, zmmword ptr [rsi], 0x8 | 2^ | 1.0 | | | 1.0 1.0 | | | | | CP | vpsrlw zmm1, zmmword ptr [rsi+0x40], 0x8 | 1 | | | | | | 1.0 | | | CP | vpackuswb zmm0, zmm0, zmm1 | 1 | | | | | | 1.0 | | | CP | vpermq zmm0, zmm0, zmm7 | 2 | | | 0.5 | 0.5 | 1.0 | | | | | vmovdqu64 zmmword ptr [rdi+rdx*1], zmm0 | 1 | | 1.0 | | | | | | | | add rsi, 0x80 | 1 | | | | | | | 1.0 | | | add rdx, 0x40 | 0F | | | | | | | | | | jnb 0xffffffffffffffd3 Total Num Of Uops: 10 (The count is unfused-domain, which is pretty dumb to total up.) Source for this: #if 1 #define IACA_start mov $111, %ebx; .byte 0x64, 0x67, 0x90 #define IACA_end mov $222, %ebx; .byte 0x64, 0x67, 0x90 #else #define IACA_start #define IACA_end #endif .global pack_high8_avx512bw pack_high8_avx512bw: # (dst, src, size) #define unroll 1 // unroll factor .altmacro .macro packblock count .if \count packblock %(count-1) .endif # vmovdqu8 \count*2*64 + 1(%rsi), %ymm0 {%k1}{z} # IACA says: x/y/zmm load uses an ALU port with masking, otherwise not. # vmovdqu8 \count*2*64 + 65(%rsi), %ymm1 {%k1}{z} vpsrlw $8, \count*2*64 + 0(%rsi), %zmm0 vpsrlw $8, \count*2*64 + 64(%rsi), %zmm1 vpackuswb %zmm1, %zmm0, %zmm0 # 1 uop for a 2-input shuffle vpermq %zmm7, %zmm0, %zmm0 # lane-crossing fixup for vpackuswb vmovdqu64 %zmm0, 64*\count(%rdi, %rdx) .endm # set up %zmm7 #movabs $0x5555555555555555, %rax #kmovq %rax, %k1 # for offset vmovdqu8 instead of shift # mov $1024, %rdx add %rdx, %rdi neg %rdx # then use (%rdi, %rdx) for store addresses? No port 7, but should micro-fuse. IACA_start .Lloop: packblock (unroll-1) add $(unroll*2*64), %rsi # add $(unroll*1*64), %rdi add $(unroll*1*64), %rdx jnc .Lloop IACA_end # not shown: unaligned and/or cleanup loop ret