https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80570
Bug ID: 80570 Summary: auto-vectorizing int->double conversion should use half-width memory operands to avoid shuffles, instead of load+extract Product: gcc Version: 8.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* When auto-vectorizing int->double conversion, gcc loads a full-width vector into a register and then unpacks the upper half to feed (v)cvtdq2pd. e.g. with AVX, we get a 256b load and then vextracti128. It's even worse with an unaligned src pointer with -mavx256-split-unaligned-load, where it does vinsertf128 -> vextractf128, without ever doing anything with the full 256b vector! On Intel SnB-family CPUs, this will bottleneck the loop on port5 throughput, because VCVTDQ2PD reg -> reg needs a port5 uop as well as a port1 uop. (And vextracti128 can only run on the shuffle unit on port5). VCVTDQ2PD with a memory source operand doesn't need the shuffle port at all on Intel Haswell and later, just the FP-add unit and a load, so it's a much better choice. (Throughput of one per clock on Sandybridge and Haswell, 2 per clock on Skylake). It's still 2 fused-domain uops, though, so I guess it can't micro-fuse the load according to Agner Fog's testing. (Or 3 on SnB). I'm pretty sure using twice as many half-width memory operands is not worse on other AVX CPUs either (AMD BD-family or Zen, or KNL), vs. max-width loads and extracting the high half. void cvti32f64_loop(double *dp, int *ip) { // ICC avoids the mistake when it doesn't emit a prologue to align the pointers #ifdef __GNUC__ dp = __builtin_assume_aligned(dp, 64); ip = __builtin_assume_aligned(ip, 64); #endif for (int i=0; i<10000 ; i++) { double tmp = ip[i]; dp[i] = tmp; } } https://godbolt.org/g/329C3P gcc.godbolt.org's "gcc7" snapshot: g++ (GCC-Explorer-Build) 8.0.0 20170429 (experimental) gcc -O3 -march=sandybridge cvti32f64_loop: xorl %eax, %eax .L2: vmovdqa (%rsi,%rax), %ymm0 vcvtdq2pd %xmm0, %ymm1 vextractf128 $0x1, %ymm0, %xmm0 vmovapd %ymm1, (%rdi,%rax,2) vcvtdq2pd %xmm0, %ymm0 vmovapd %ymm0, 32(%rdi,%rax,2) addq $32, %rax cmpq $40000, %rax jne .L2 vzeroupper ret gcc does the same thing for -march=haswell, but uses vextracti128. This is obviously really silly. For comparison, clang 4.0 -O3 -march=sandybridge -fno-unroll-loops emits: xorl %eax, %eax .LBB0_1: vcvtdq2pd (%rsi,%rax,4), %ymm0 vmovaps %ymm0, (%rdi,%rax,8) addq $4, %rax cmpq $10000, %rax # imm = 0x2710 jne .LBB0_1 vzeroupper retq This should come close to one 256b store per clock (on Haswell), even with unrolling disabled. ---- With -march=nehalem, gcc gets away with it for this simple not-unrolled loop (without hurting throughput I think), but only because this strategy effectively unrolls the loop (doing two stores per add + cmp/jne), and Nehalem can run shuffles on two execution ports (so the pshufd can run on port1, while the cvtdq2pd can run on ports 1+5). So it's 10 fused-domain uops per 2 stores instead of 5 per 1 store. Depending on how the loop buffer handles non-multiple-of-4 uop counts, this might be a wash. (Of course, with any other work in the loop, or with unrolling, the memory-operand strategy is much better). CVTDQ2PD's memory operand is only 64 bits, so even the non-AVX version doesn't fault if misaligned. ------ It's even more horrible without aligned pointers, when the sandybridge version (which splits unaligned 256b loads/stores) uses vinsertf128 to emulate a 256b load, and then does vextractf128 right away: inner_loop: # gcc8 -march=sandybridge without __builtin_assume_aligned vmovdqu (%r8,%rax), %xmm0 vinsertf128 $0x1, 16(%r8,%rax), %ymm0, %ymm0 vcvtdq2pd %xmm0, %ymm1 vextractf128 $0x1, %ymm0, %xmm0 vmovapd %ymm1, (%rcx,%rax,2) vcvtdq2pd %xmm0, %ymm0 vmovapd %ymm0, 32(%rcx,%rax,2) This is obviously really really bad, and should probably be checked for and avoided in case there are things other than int->double autovec that could lead to doing this. --- With -march=skylake-avx512, gcc does the AVX512 version of the same thing: zmm load and then extra the upper 256b