https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97127
--- Comment #11 from Hongtao.liu <crazylht at gmail dot com> --- (In reply to Michael_S from comment #10) > (In reply to Hongtao.liu from comment #9) > > (In reply to Michael_S from comment #8) > > > What are values of gcc "loop" cost of the relevant instructions now? > > > 1. AVX256 Load > > > 2. FMA3 ymm,ymm,ymm > > > 3. AVX256 Regmove > > > 4. FMA3 mem,ymm,ymm > > > > For skylake, outside of register allocation. > > > > they are > > 1. AVX256 Load ---- 10 > > 2. FMA3 ymm,ymm,ymm --- 16 > > 3. AVX256 Regmove --- 2 > > 4. FMA3 mem,ymm,ymm --- 32 > > > > In RA, no direct cost for fma instrcutions, but we can disparage memory > > alternative in FMA instructions, but again, it may hurt performance in some > > cases. > > > > 1. AVX256 Load ---- 10 > > 3. AVX256 Regmove --- 2 > > > > BTW: we have done a lot of experiments with different cost models and no > > significant performance impact on SPEC2017. > > Thank you. > With relative costs like these gcc should generate 'FMA3 mem,ymm,ymm' only > in conditions of heavy registers pressure. So, why it generates it in my > loop, where registers pressure in the innermost loop is light and even in > the next outer level the pressure isn't heavy? > What am I missing? the actual transformation gcc did is vmovuxx (mem1), %ymmA pass_combine vmovuxx (mem), %ymmD ----> vmovuxx (mem1), %ymmA vfmadd213 %ymmD,%ymmC,%ymmA vfmadd213 (mem),%ymmC,%ymmA then RA works like RA vmovuxx (mem1), %ymmA ----> %vmovaps %ymmB, %ymmA vfmadd213 (mem),%ymmC,%ymmA vfmadd213 (mem),%ymmC,%ymmA it "look like" but actually not this one. vmovuxx (mem), %ymmA vfnmadd231xx %ymmB, %ymmC, %ymmA transformed to vmovaxx %ymmB, %ymmA vfnmadd213xx (mem), %ymmC, %ymmA ymmB is allocate for (mem1) not (mem)