https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071

            Bug ID: 89071
           Summary: AVX vcvtsd2ss lets us avoid PXOR dependency breaking
                    for scalar float<->double
           Product: gcc
           Version: 9.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: peter at cordes dot ca
  Target Milestone: ---

float cvt(double unused, double xmm1) { return xmm1; }

g++ (GCC-Explorer-Build) 9.0.0 20190120 (experimental):

        vxorps  %xmm0, %xmm0, %xmm0
        vcvtsd2ss       %xmm1, %xmm0, %xmm0    # merge into XMM0

clang7.0
        vcvtsd2ss       %xmm1, %xmm1, %xmm0    # both sources are from XMM1, no
false dep

gcc already uses this trick for SQRTSS/SD, but not for float<->double
conversion.  I haven't checked all the other scalar instructions, but roundss
for floor() does neither and has a false dependency.  (i.e. it chooses the
output register as the merge-target, not the actual input.)

 return floorf(x);  ->   vroundss        $9, %xmm1, %xmm0, %xmm0

Some testcases:

https://godbolt.org/z/-rqUVZ


---

In SSE, one-input scalar instructions like CVT* and SQRTSS/SD have an output
dependency because of Intel's short-sighted ISA design optimizing for
Pentium-III's 64-bit SIMD: zero-extending to fill the destination XMM register
would have cost an extra uop to write the upper half of the destination.

For consistency(?), SSE2 scalar instructions (new with Pentium 4 which had
128-bit SIMD execution units / register file) have the same behaviour of
merging into the low 64 bits of the destination, even conversion between double
and float between two xmm registers, which didn't exist before SSE2. 
(Previously conversion instructions were only between float in XMM and integers
in scalar or MMX regs, or packed-integer <-> ps which filled the whole XMM reg
and thus avoided a false dependency).

(Fortunately this isn't a problem for 2-input instructions like ADDSS: the
operation already depends on both registers.)

---

The VEX encoding makes the merge-target separate from the actual destination,
so we can finally avoid false dependencies without wasting an instruction
breaking it.  (When the source is already in an XMM register).


For instructions where the source isn't an XMM register (e.g. memory or integer
reg for int->FP conversions), one zeroed register can be used as a read-only
merge target by any number of scalar AVX instructions, including in a loop. 
That's bug 80571.


(It's unfortunate that Intel didn't take the opportunity to give the AVX
versions subtly different semantics, and zero-extend into the target register. 
That would probably have enabled vcvtsd2ss to be single-uop instead of 2 on
Sandybridge-family.  IDK if they didn't think of that, or if they wanted strict
consistency with the semantics of the SSE version, or if they thought decoding
/ internals would be easier if they didn't have to omit the
merge-into-destination part of the scalar operation.  At least they made the
extra dependency an explicit input, so we can choose a register other than the
destination, but it's so rarely useful to actually merge into the low 64 or 32
of another reg that it's just long-term harmful to gimp the ISA with an extra
dependency for these instructions, especially integer->FP.)



(I suspect that most of the dep-breaking gcc does isn't gaining any speed, but
the trick is figuring out when we can omit it while being sure that we don't
couple things into one big loop-carried chain, or serialize some things that
OoO exec could otherwise benefit from hiding.  Within one function with no
calls, we might be able to prove that a false dep isn't serializing anything
important (e.g. if there's already enough ILP and something else breaks a dep
on that register between loop iterations), but in general it's hard if we can't
pick a register that was already part of the dep chain that led to the input
for this operation, and thus is harmless to introduce a dep on.)

----

Relevant instructions that can exist in scalar xmm,xmm form:

VROUNDSS/SD  (gcc leaves a false dep, clang gets it right)

VSQRTSS/SD  (gcc already gets this right)
VRCPSS
VRSQRTSS  haven't checked

[V]CVTSS2SD xmm,xmm  (Skylake: SRC1/output dependency is a separate 1c latency
32-bit merge uop)
  The memory-source version is still 2 uops.

[V]CVTSD2SS xmm,xmm  (Skylake: SRC1/output dependency is the main 4c conversion
uop, the extra uop is first, maybe extracting 32 bits from the src?)
 The memory-source version of [V]CVTSD2SS is only 1 uop!

So avoiding a false dep by loading with MOVSS/MOVSD and then using the reg-reg
version is a bad idea for CVTSD2SS.  It's actually much better to PXOR and then
CVTSD2SS (mem), %xmm, so clang's strategy of loading and then reg-reg
conversion is a missed-optimization.

I haven't checked on micro-fusion of indexed addressing modes with either of
those.

It doesn't look like a scalar load and then using a packed conversion would be
good either.  CVTPS2PD and PD2PS xmm,xmm both need a port 5 uop as well as the
FMA-port uop to actually do the conversion.  (Maybe to shuffle from/to 2 floats
in the low 64 vs. 2 doubles filling a register?)

---

I've mostly only looked at Skylake numbers for this, not AMD or KNL, or earlier
Intel.  Using the same input as both source operands when the source is XMM is
just pure win everywhere, though, when VEX encoding is available.

Tricks for what to do without AVX, or with a memory source, might depend on
-mtune, but this bug report is about making sure we use SRC1=SRC2 for one-input
V...SS and V...SD instructions, choosing the input that's already needed as the
merge target to grab the upper bits from.


----

Other conversions are only packed, or only between GP/mem and XMM.  Note that
Agner Fog's skylake numbers for CVTSI2SD are wrong: it's 1c throughput for
xmm,r32/r64 and 0.5c throughput for xmm, m32/m64.  He lists it as 2c throughput
for both.

CVTDQ2PD with a memory source is 1 micro-fused uop, not 2.  The xmm,xmm version
is 2 uops, including a port 5 shuffle(?).  I guess the memory-source version
uses a broadcast load to get the data where the FMA/convert unit wants it.

Other than than, Agner Fog's numbers (https://agner.org/optimize) on Skylake
match my testing with perf counters.

Reply via email to