https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80571
Bug ID: 80571 Summary: AVX allows multiple vcvtsi2ss/sd (integer -> float/double) to reuse a single dep-breaking vxorps, even hoisting it out of loops Product: gcc Version: 8.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* See also more discussion on a clang bug about what's optimal for scalar int->fp conversion (https://bugs.llvm.org/show_bug.cgi?id=22024#c11). The important point is that using vcvtsi2sd/ss with a src vector reg different from the destination reg allows reusing the same "safe" source to avoid false dependencies, using of one vxorps for multiple conversions. (Even hoisted out of loops). // see https://godbolt.org/g/c9k4SH for gcc8, clang4, icc17, and MSVC CL void cvti64f64_loop(double *dp, long long *ip) { for (int i=0; i<1000 ; i++) { dp[i] = ip[i]; // int64 can't vectorize without AVX512 } } compiles with gcc6.3.1 and gcc8 20170429 -O3 -march=haswell: cvti64f64_loop: xorl %eax, %eax .L6: vxorpd %xmm0, %xmm0, %xmm0 vcvtsi2sdq (%rsi,%rax), %xmm0, %xmm0 vmovsd %xmm0, (%rdi,%rax) addq $8, %rax cmpq $8000, %rax jne .L6 ret But it could compile to xorl %eax, %eax vxorpd %xmm1, %xmm1, %xmm1 #### Hoisted out of the loop .L6: vcvtsi2sdq (%rsi,%rax), %xmm1, %xmm0 vmovsd %xmm0, (%rdi,%rax) addq $8, %rax cmpq $8000, %rax jne .L6 ret ---- This trick requires AVX, of course. If the loop needs a vector constant, you can use that instead of a specially-zeroed register. (The upper elements don't have to be 0.0 for ...ss instructions, and the x86-64 SysV ABI allows passing scalar float/double args with garbage (not zeros) in the high bytes. This already happens in some code, so library implementers already have to avoid assuming that high elements are always clear, even with hand-written asm). clang already hoists vxorps out of loops it, but only by looking for "dead" registers, not by reusing a reg holding a constant. (e.g. enabling those multiplies in the godbolt link to use up all the xmm regs gets clang to put a vxorps in the loop instead of merging into a reg holding one of them.) Using a constant has the downside that loading the constant might cache-miss, delaying a bunch of loop-setup work from happening (and loop-setup is where int->float is probably most common). So maybe it would be best to vxorps-zero a register and use it for int->FP conversion if several other instructions depend on that before any depend on the constant. The zeroed register can later have a constant loaded into it or whatever, and use that as a safe source register inside the loop. Or if there's a constant that's about to be used with the result of the int->fp conversion, it's maybe not bad to load that constant and then use that xmm reg as the merge-target for a scalar conversion. If OOO execution can do the load early (and it doesn't miss in cache), then there's no extra latency. Or if it does miss, then there's only an extra cycle or two of latency for that dependency chain vs. spending an instruction on vxorpd-zeroing a target to convert into, separate from loading the constant. ---- This trick works for non-loops, of course: void cvti32f32(float *A, float *B, int x, int y) { *B = y; *A = x; } currently compiles to (-O3 -march=haswell) cvti32f32: vxorps %xmm0, %xmm0, %xmm0 vcvtsi2ss %ecx, %xmm0, %xmm0 vmovss %xmm0, (%rsi) vxorps %xmm0, %xmm0, %xmm0 vcvtsi2ss %edx, %xmm0, %xmm0 vmovss %xmm0, (%rdi) But could compile (with no loss of safety against false deps) to cvti32f32: vxorps %xmm1, %xmm1, %xmm1 # reused for both vcvtsi2ss %ecx, %xmm1, %xmm0 vmovss %xmm0, (%rsi) vcvtsi2ss %edx, %xmm1, %xmm0 # this convert can go into xmm1 if we want to have both in regs at once vmovss %xmm0, (%rdi) ret Or, with the same amount of safety and fewer fused-domain uops on Intel SnB-family CPUs, and not requiring AVX (but of course works fine with AVX): cvti32f32: movd %ecx, %xmm0 # breaks deps on xmm0 cvtdq2ps %xmm0, %xmm0 # 1 uop for port1 movss %xmm0, (%rsi) movd %edx, %xmm0 cvtdq2ps %xmm0, %xmm0 movss %xmm0, (%rdi) ret But this is only good for 32-bit integer -> float, not double. (Because cvtdq2pd also takes a port5 uop, unlike conversion to ps). The code-size is also slightly larger, taking 2 instructions per conversion instead of 1 for AVX when a safe register to merge into is already available. See full analysis of this (uop counts and stuff for Intel SnB-family) at the bottom of this comment: https://bugs.llvm.org/show_bug.cgi?id=22024#c11