https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071
--- Comment #16 from Uroš Bizjak <ubizjak at gmail dot com> --- (In reply to Peter Cordes from comment #15) > (In reply to Uroš Bizjak from comment #13) > > I assume that memory inputs are not problematic for SSE/AVX {R,}SQRT, RCP > > and ROUND instructions. Contrary to CVTSI2S{S,D}, CVTSS2SD and CVTSD2SS, we > > currently don't emit XOR clear in front of these instrucitons, when they > > operate with memory input. > > They *do* have an output dependency. It might or might not actually be a > problem and be worth clogging the front-end with extra uops to avoid, it > depending on surrounding code. >.< OK, I'll proceed with the patch from Comment #14 then. > * CVTSS2SD vs. PD, and SD2SS vs. PD2PS > packed is slower on k8, bdver1-4 (scalar avoids the shuffle uop), > Nano3000, KNL. On Silvermont by just 1 cycle latency (so even a MOVAPS on > the critical path would make it equal.) Similar on Atom. Slower on CPUs > that do 128-bit vectors as two 64-bit uops, like Bobcat, and Pentium M / K8 > and older. > > packed is *faster* on K10, Goldmont/GDM Plus (same latency, 1c vs. 2c > throughput), Prescott, P4. Much faster on Jaguar (1c vs. 8c throughput, and > 1 uop vs. 2). We do have infrastructure to convert scalar conversions to packed: /* X86_TUNE_USE_VECTOR_FP_CONVERTS: Prefer vector packed SSE conversion from FP to FP. This form of instructions avoids partial write to the destination. */ DEF_TUNE (X86_TUNE_USE_VECTOR_FP_CONVERTS, "use_vector_fp_converts", m_AMDFAM10) /* X86_TUNE_USE_VECTOR_CONVERTS: Prefer vector packed SSE conversion from integer to FP. */ DEF_TUNE (X86_TUNE_USE_VECTOR_CONVERTS, "use_vector_converts", m_AMDFAM10) And, as can be seen from above tunes, they are currently enabled for AMDFAM10, it is just a matter of selecting relevant tune for the selected target.