https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071
--- Comment #15 from Peter Cordes <peter at cordes dot ca> --- (In reply to Uroš Bizjak from comment #13) > I assume that memory inputs are not problematic for SSE/AVX {R,}SQRT, RCP > and ROUND instructions. Contrary to CVTSI2S{S,D}, CVTSS2SD and CVTSD2SS, we > currently don't emit XOR clear in front of these instrucitons, when they > operate with memory input. They *do* have an output dependency. It might or might not actually be a problem and be worth clogging the front-end with extra uops to avoid, it depending on surrounding code. >.< e.g. ROUNDSD: DEST[127:63] remains unchanged Thanks, Intel. You'd think by SSE4.1 they would have learned that false dependencies suck, and that it's extremely rare to actually take advantage of this merge behaviour, but no. For register-source ROUNDSD / ROUNDSS, we can use ROUNDPD / ROUNDPS which write the full destination register and have identical performance on all CPUs that support them. (Except Silvermont, where roundps/pd have 5c latency vs. 4c for roundss/sd. Goldmont makes them equal.) KNL has faster (V)ROUNDPS/D than ROUNDSS/SD, maybe only because of the SSE encoding? Agner Fog isn't clear, and doesn't have an entry that would match vroundss/sd. Copy-and-round is good for avoiding extra MOVAPS instructions which can make SSE code front-end bound, and reduce the effective size of the out-of-order window. Preserving FP exception semantics for packed instead of scalar register-source: * if the upper element(s) of the source is/are known 0, we can always do this with sqrt and round, and convert: they won't produce any FP exceptions, not even inexact. (But not rsqrt / rcpps, of course.) This will be the case after a scalar load, so if we need the original value in memory *and* the result of one of these instructions, we're all set. * with rounding, the immediate can control masking of precision exceptions, but not Invalid which is always raised by SRC = SNaN. If we can rule out SNaN in the upper elements of the input, we can use ROUNDPS / ROUNDPD roundps/d can't produce a denormal output. I don't think denormal inputs slow it down on any CPUs, but worth checking for cases where we don't care about preserving exception semantics and want to use it with potentially-arbitrary garbage in high elements. rsqrtps can't produce a denormal output because sqrt makes the output closer to 1.0 (reducing the magnitude of the exponent). (And thus neither can sqrtps.) SQRTPS/PD is the same performance as SQRTSS/SD on new CPUs, but old CPUs that crack 128-bit ops into 64-bit are slower: Pentium III, Pentium M, and Bobcat. And Jaguar for sqrt. Also Silvermont is *MUCH* slower for SQRTPD/PS then SD/SS, and even Goldmont Plus has slower packed SQRT, RSQRT, and RCP than scalar. But RCPPS can produce a denormal. (double)1.0/FLT_MAX = 2.938736e-39, which is smaller than FLT_MIN = 1.175494e-38 ---- So according to Agner's tables: * ROUNDPS/PD is never slower than ROUNDSS/SD on any CPU that support them. * SQRTPS/PD *are* slower than scalar on Silvermont through Goldmont Plus, and Bobcat, Nano 3000, and P4 Prescott/Nocona. By about a factor of 2, enough that should probably care about it for tune=generic. For ss/ps only (not double), also K10 and Jaguar have slower sqrtps than ss. Also in 32-bit mode, P4, Pentium M and earlier Intel, and Atom, are much slower for packed than scalar sqrt. SQRTPD is *faster* than SQRTSD on KNL. (But hopefully we're never tuning for KNL without AVX available.) * RSQRT / RCP: packed is slower on Atom, Silvermont, and Goldmont (multi-uop so a big decode stall). Somewhat slower on Goldmont Plus (1 uop but half throughput). Also slower on Nano3000, and slightly slower on Pentium 4 (before and after Prescott/Nocona), and KNL. (But hopefully KNL can always use VRSQRT28PS/PD or scalar) Pentium M and older again decode as at least 2 uops for packed, same as Bobcat and K8. Same performance for packed vs. scalar on Jaguar, K10, bdver1-4, ryzen, Core2 and later, and SnB-family. * CVTSS2SD vs. PD, and SD2SS vs. PD2PS packed is slower on k8, bdver1-4 (scalar avoids the shuffle uop), Nano3000, KNL. On Silvermont by just 1 cycle latency (so even a MOVAPS on the critical path would make it equal.) Similar on Atom. Slower on CPUs that do 128-bit vectors as two 64-bit uops, like Bobcat, and Pentium M / K8 and older. packed is *faster* on K10, Goldmont/GDM Plus (same latency, 1c vs. 2c throughput), Prescott, P4. Much faster on Jaguar (1c vs. 8c throughput, and 1 uop vs. 2). same speed (but without the false dep) for SnB-family (mostly), Core 2, Ryzen. Odd stuff: Agner reports: Nehalem: ps2pd = 2 uops / 2c, ss2sd = 1 uop / 1c. (I guess just zero-padding the significand, no rounding required). pd2ps and sd2ss are equal at 2 uops / 4c latency. SnB: cvtpd2ps is 1c higher latency than sd2ss. IvB: ps2pd on IvB is 1c vs. 2c for ss2sd On HSW and later things have settled down to exactly the same. I didn't check instlatx64 or https://uops.info/ Not sure what to say for float <-> double conversions to tune=generic. Bulldozer-family is still fairly relevant, and scalar is significantly faster and fewer uops there. But otherwise most of the CPUs where packed is slower are not relevant for tune=generic. Again this was *just* for the xmm,xmm versions, not memory source.