https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071
Bug ID: 89071 Summary: AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double Product: gcc Version: 9.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- float cvt(double unused, double xmm1) { return xmm1; } g++ (GCC-Explorer-Build) 9.0.0 20190120 (experimental): vxorps %xmm0, %xmm0, %xmm0 vcvtsd2ss %xmm1, %xmm0, %xmm0 # merge into XMM0 clang7.0 vcvtsd2ss %xmm1, %xmm1, %xmm0 # both sources are from XMM1, no false dep gcc already uses this trick for SQRTSS/SD, but not for float<->double conversion. I haven't checked all the other scalar instructions, but roundss for floor() does neither and has a false dependency. (i.e. it chooses the output register as the merge-target, not the actual input.) return floorf(x); -> vroundss $9, %xmm1, %xmm0, %xmm0 Some testcases: https://godbolt.org/z/-rqUVZ --- In SSE, one-input scalar instructions like CVT* and SQRTSS/SD have an output dependency because of Intel's short-sighted ISA design optimizing for Pentium-III's 64-bit SIMD: zero-extending to fill the destination XMM register would have cost an extra uop to write the upper half of the destination. For consistency(?), SSE2 scalar instructions (new with Pentium 4 which had 128-bit SIMD execution units / register file) have the same behaviour of merging into the low 64 bits of the destination, even conversion between double and float between two xmm registers, which didn't exist before SSE2. (Previously conversion instructions were only between float in XMM and integers in scalar or MMX regs, or packed-integer <-> ps which filled the whole XMM reg and thus avoided a false dependency). (Fortunately this isn't a problem for 2-input instructions like ADDSS: the operation already depends on both registers.) --- The VEX encoding makes the merge-target separate from the actual destination, so we can finally avoid false dependencies without wasting an instruction breaking it. (When the source is already in an XMM register). For instructions where the source isn't an XMM register (e.g. memory or integer reg for int->FP conversions), one zeroed register can be used as a read-only merge target by any number of scalar AVX instructions, including in a loop. That's bug 80571. (It's unfortunate that Intel didn't take the opportunity to give the AVX versions subtly different semantics, and zero-extend into the target register. That would probably have enabled vcvtsd2ss to be single-uop instead of 2 on Sandybridge-family. IDK if they didn't think of that, or if they wanted strict consistency with the semantics of the SSE version, or if they thought decoding / internals would be easier if they didn't have to omit the merge-into-destination part of the scalar operation. At least they made the extra dependency an explicit input, so we can choose a register other than the destination, but it's so rarely useful to actually merge into the low 64 or 32 of another reg that it's just long-term harmful to gimp the ISA with an extra dependency for these instructions, especially integer->FP.) (I suspect that most of the dep-breaking gcc does isn't gaining any speed, but the trick is figuring out when we can omit it while being sure that we don't couple things into one big loop-carried chain, or serialize some things that OoO exec could otherwise benefit from hiding. Within one function with no calls, we might be able to prove that a false dep isn't serializing anything important (e.g. if there's already enough ILP and something else breaks a dep on that register between loop iterations), but in general it's hard if we can't pick a register that was already part of the dep chain that led to the input for this operation, and thus is harmless to introduce a dep on.) ---- Relevant instructions that can exist in scalar xmm,xmm form: VROUNDSS/SD (gcc leaves a false dep, clang gets it right) VSQRTSS/SD (gcc already gets this right) VRCPSS VRSQRTSS haven't checked [V]CVTSS2SD xmm,xmm (Skylake: SRC1/output dependency is a separate 1c latency 32-bit merge uop) The memory-source version is still 2 uops. [V]CVTSD2SS xmm,xmm (Skylake: SRC1/output dependency is the main 4c conversion uop, the extra uop is first, maybe extracting 32 bits from the src?) The memory-source version of [V]CVTSD2SS is only 1 uop! So avoiding a false dep by loading with MOVSS/MOVSD and then using the reg-reg version is a bad idea for CVTSD2SS. It's actually much better to PXOR and then CVTSD2SS (mem), %xmm, so clang's strategy of loading and then reg-reg conversion is a missed-optimization. I haven't checked on micro-fusion of indexed addressing modes with either of those. It doesn't look like a scalar load and then using a packed conversion would be good either. CVTPS2PD and PD2PS xmm,xmm both need a port 5 uop as well as the FMA-port uop to actually do the conversion. (Maybe to shuffle from/to 2 floats in the low 64 vs. 2 doubles filling a register?) --- I've mostly only looked at Skylake numbers for this, not AMD or KNL, or earlier Intel. Using the same input as both source operands when the source is XMM is just pure win everywhere, though, when VEX encoding is available. Tricks for what to do without AVX, or with a memory source, might depend on -mtune, but this bug report is about making sure we use SRC1=SRC2 for one-input V...SS and V...SD instructions, choosing the input that's already needed as the merge target to grab the upper bits from. ---- Other conversions are only packed, or only between GP/mem and XMM. Note that Agner Fog's skylake numbers for CVTSI2SD are wrong: it's 1c throughput for xmm,r32/r64 and 0.5c throughput for xmm, m32/m64. He lists it as 2c throughput for both. CVTDQ2PD with a memory source is 1 micro-fused uop, not 2. The xmm,xmm version is 2 uops, including a port 5 shuffle(?). I guess the memory-source version uses a broadcast load to get the data where the FMA/convert unit wants it. Other than than, Agner Fog's numbers (https://agner.org/optimize) on Skylake match my testing with perf counters.