https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127

--- Comment #11 from Hongtao.liu <crazylht at gmail dot com> ---
(In reply to Michael_S from comment #10)
> (In reply to Hongtao.liu from comment #9)
> > (In reply to Michael_S from comment #8)
> > > What are values of gcc "loop" cost of the relevant instructions now?
> > > 1. AVX256 Load
> > > 2. FMA3 ymm,ymm,ymm
> > > 3. AVX256 Regmove
> > > 4. FMA3 mem,ymm,ymm
> > 
> > For skylake, outside of register allocation.
> > 
> > they are
> > 1. AVX256 Load  ---- 10
> > 2. FMA3 ymm,ymm,ymm --- 16
> > 3. AVX256 Regmove  --- 2
> > 4. FMA3 mem,ymm,ymm --- 32
> > 
> > In RA, no direct cost for fma instrcutions, but we can disparage memory
> > alternative in FMA instructions, but again, it may hurt performance in some
> > cases.
> > 
> > 1. AVX256 Load  ---- 10
> > 3. AVX256 Regmove  --- 2
> > 
> > BTW: we have done a lot of experiments with different cost models and no
> > significant performance impact on SPEC2017.
> 
> Thank you.
> With relative costs like these gcc should generate 'FMA3 mem,ymm,ymm' only
> in conditions of heavy registers pressure. So, why it generates it in my
> loop, where registers pressure in the innermost loop is light and even in
> the next outer level the pressure isn't heavy?
> What am I missing?

the actual transformation gcc did is

vmovuxx (mem1), %ymmA     pass_combine     
vmovuxx (mem), %ymmD         ---->     vmovuxx   (mem1), %ymmA
vfmadd213 %ymmD,%ymmC,%ymmA            vfmadd213 (mem),%ymmC,%ymmA

then RA works like
                            RA
vmovuxx (mem1), %ymmA      ---->  %vmovaps %ymmB, %ymmA
vfmadd213 (mem),%ymmC,%ymmA       vfmadd213 (mem),%ymmC,%ymmA

it "look like" but actually not this one.

 vmovuxx      (mem), %ymmA
 vfnmadd231xx %ymmB, %ymmC, %ymmA
transformed to
 vmovaxx      %ymmB, %ymmA
 vfnmadd213xx (mem), %ymmC, %ymmA

ymmB is allocate for (mem1) not (mem)

Reply via email to