https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071

--- Comment #15 from Peter Cordes <peter at cordes dot ca> ---
(In reply to Uroš Bizjak from comment #13)
> I assume that memory inputs are not problematic for SSE/AVX {R,}SQRT, RCP
> and ROUND instructions. Contrary to CVTSI2S{S,D}, CVTSS2SD and CVTSD2SS, we
> currently don't emit XOR clear in front of these instrucitons, when they
> operate with memory input.

They *do* have an output dependency.  It might or might not actually be a
problem and be worth clogging the front-end with extra uops to avoid, it
depending on surrounding code. >.<

e.g. ROUNDSD:  DEST[127:63] remains unchanged
Thanks, Intel.  You'd think by SSE4.1 they would have learned that false
dependencies suck, and that it's extremely rare to actually take advantage of
this merge behaviour, but no.

For register-source ROUNDSD / ROUNDSS, we can use ROUNDPD / ROUNDPS which write
the full destination register and have identical performance on all CPUs that
support them.  (Except Silvermont, where roundps/pd have 5c latency vs. 4c for
roundss/sd.  Goldmont makes them equal.)  KNL has faster (V)ROUNDPS/D than
ROUNDSS/SD, maybe only because of the SSE encoding?  Agner Fog isn't clear, and
doesn't have an entry that would match vroundss/sd.

Copy-and-round is good for avoiding extra MOVAPS instructions which can make
SSE code front-end bound, and reduce the effective size of the out-of-order
window.

Preserving FP exception semantics for packed instead of scalar register-source:

* if the upper element(s) of the source is/are known 0, we can always do this
with sqrt and round, and convert: they won't produce any FP exceptions, not
even inexact.  (But not rsqrt / rcpps, of course.)
  This will be the case after a scalar load, so if we need the original value
in memory *and* the result of one of these instructions, we're all set.

* with rounding, the immediate can control masking of precision exceptions, but
not Invalid which is always raised by SRC = SNaN.  If we can rule out SNaN in
the upper elements of the input, we can use ROUNDPS / ROUNDPD

roundps/d can't produce a denormal output.  I don't think denormal inputs slow
it down on any CPUs, but worth checking for cases where we don't care about
preserving exception semantics and want to use it with potentially-arbitrary
garbage in high elements.


rsqrtps can't produce a denormal output because sqrt makes the output closer to
1.0 (reducing the magnitude of the exponent).  (And thus neither can sqrtps.) 
SQRTPS/PD is the same performance as SQRTSS/SD on new CPUs, but old CPUs that
crack 128-bit ops into 64-bit are slower: Pentium III, Pentium M, and Bobcat. 
And Jaguar for sqrt.  Also Silvermont is *MUCH* slower for SQRTPD/PS then
SD/SS, and even Goldmont Plus has slower packed SQRT, RSQRT, and RCP than
scalar.

But RCPPS can produce a denormal.  (double)1.0/FLT_MAX = 2.938736e-39, which is
smaller than FLT_MIN = 1.175494e-38

----

So according to Agner's tables:

* ROUNDPS/PD is never slower than ROUNDSS/SD on any CPU that support them.
* SQRTPS/PD *are* slower than scalar on Silvermont through Goldmont Plus, and
Bobcat, Nano 3000, and P4 Prescott/Nocona.  By about a factor of 2, enough that
should probably care about it for tune=generic.  For ss/ps only (not double),
also K10 and Jaguar have slower sqrtps than ss.  Also in 32-bit mode, P4,
Pentium M and earlier Intel, and Atom, are much slower for packed than scalar
sqrt.
  SQRTPD is *faster* than SQRTSD on KNL.  (But hopefully we're never tuning for
KNL without AVX available.)

* RSQRT / RCP: packed is slower on Atom, Silvermont, and Goldmont (multi-uop so
a big decode stall).  Somewhat slower on Goldmont Plus (1 uop but half
throughput).  Also slower on Nano3000, and slightly slower on Pentium 4 (before
and after Prescott/Nocona), and KNL.  (But hopefully KNL can always use
VRSQRT28PS/PD or scalar)
  Pentium M and older again decode as at least 2 uops for packed, same as
Bobcat and K8.
  Same performance for packed vs. scalar on Jaguar, K10, bdver1-4, ryzen, Core2
and later, and SnB-family.

* CVTSS2SD vs. PD, and SD2SS vs. PD2PS
  packed is slower on k8, bdver1-4 (scalar avoids the shuffle uop), Nano3000,
KNL.  On Silvermont by just 1 cycle latency (so  even a MOVAPS on the critical
path would make it equal.)  Similar on Atom.  Slower on CPUs that do 128-bit
vectors as two 64-bit uops, like Bobcat, and Pentium M / K8 and older.

  packed is *faster* on K10, Goldmont/GDM Plus (same latency, 1c vs. 2c
throughput), Prescott, P4.  Much faster on Jaguar (1c vs. 8c throughput, and 1
uop vs. 2).

  same speed (but without the false dep) for SnB-family (mostly), Core 2,
Ryzen.

  Odd stuff: Agner reports:
    Nehalem: ps2pd = 2 uops / 2c, ss2sd = 1 uop / 1c.  (I guess just
zero-padding the significand, no rounding required).  pd2ps and sd2ss are equal
at 2 uops / 4c latency.
    SnB: cvtpd2ps is 1c higher latency than sd2ss.
    IvB: ps2pd on IvB is 1c vs. 2c for ss2sd
    On HSW and later things have settled down to exactly the same.  I didn't
check instlatx64 or https://uops.info/

Not sure what to say for float <-> double conversions to tune=generic. 
Bulldozer-family is still fairly relevant, and scalar is significantly faster
and fewer uops there.  But otherwise most of the CPUs where packed is slower
are not relevant for tune=generic.

Again this was *just* for the xmm,xmm versions, not memory source.

Reply via email to