https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071

--- Comment #16 from Uroš Bizjak <ubizjak at gmail dot com> ---
(In reply to Peter Cordes from comment #15)
> (In reply to Uroš Bizjak from comment #13)
> > I assume that memory inputs are not problematic for SSE/AVX {R,}SQRT, RCP
> > and ROUND instructions. Contrary to CVTSI2S{S,D}, CVTSS2SD and CVTSD2SS, we
> > currently don't emit XOR clear in front of these instrucitons, when they
> > operate with memory input.
> 
> They *do* have an output dependency.  It might or might not actually be a
> problem and be worth clogging the front-end with extra uops to avoid, it
> depending on surrounding code. >.<

OK, I'll proceed with the patch from Comment #14 then.

> * CVTSS2SD vs. PD, and SD2SS vs. PD2PS
>   packed is slower on k8, bdver1-4 (scalar avoids the shuffle uop),
> Nano3000, KNL.  On Silvermont by just 1 cycle latency (so  even a MOVAPS on
> the critical path would make it equal.)  Similar on Atom.  Slower on CPUs
> that do 128-bit vectors as two 64-bit uops, like Bobcat, and Pentium M / K8
> and older.
> 
>   packed is *faster* on K10, Goldmont/GDM Plus (same latency, 1c vs. 2c
> throughput), Prescott, P4.  Much faster on Jaguar (1c vs. 8c throughput, and
> 1 uop vs. 2).

We do have infrastructure to convert scalar conversions to packed:

/* X86_TUNE_USE_VECTOR_FP_CONVERTS: Prefer vector packed SSE conversion
   from FP to FP.  This form of instructions avoids partial write to the
   destination.  */
DEF_TUNE (X86_TUNE_USE_VECTOR_FP_CONVERTS, "use_vector_fp_converts",
          m_AMDFAM10)

/* X86_TUNE_USE_VECTOR_CONVERTS: Prefer vector packed SSE conversion
   from integer to FP. */
DEF_TUNE (X86_TUNE_USE_VECTOR_CONVERTS, "use_vector_converts", m_AMDFAM10)

And, as can be seen from above tunes, they are currently enabled for AMDFAM10,
it is just a matter of selecting relevant tune for the selected target.

Reply via email to