On Thu, Jul 25, 2019 at 2:21 PM Richard Biener <rguent...@suse.de> wrote: > > On Thu, 25 Jul 2019, Martin Jambor wrote: > > > Hello, > > > > On Tue, Jul 23 2019, Richard Biener wrote: > > > The following fixes the runtime regression of 456.hmmer caused > > > by matching ICC in code generation and using cmov more aggressively > > > (through GIMPLE level MAX_EXPR usage). Appearantly (discovered > > > by manual assembler editing) using the SSE unit for performing > > > SImode loads, adds and then two singed max operations plus stores > > > is quite a bit faster than cmovs - even faster than the original > > > single cmov plus branchy second max. Even more so for AMD CPUs > > > than Intel CPUs. > > > > > > Instead of hacking up some pattern recognition pass to transform > > > integer mode memory-to-memory computation chains involving > > > conditional moves to "vector" code (similar to what STV does > > > for TImode ops on x86_64) the following simply allows SImode > > > into SSE registers (support for this is already there in some > > > important places like move patterns!). For the particular > > > case of 456.hmmer the required support is loads/stores > > > (already implemented), SImode adds and SImode smax. > > > > > > So the patch adds a smax pattern for SImode (we don't have any > > > for scalar modes but currently expand via a conditional move sequence) > > > emitting as SSE vector max or cmp/cmov depending on the alternative. > > > > > > And it amends the *add<mode>_1 pattern with SSE alternatives > > > (which have to come before the memory alternative as IRA otherwise > > > doesn't consider reloading a memory operand to a register). > > > > > > With this in place the runtime of 456.hmmer improves by 10% > > > on Haswell which is back to before regression speed but not > > > to same levels as seen with manually editing just the single > > > important loop. > > > > > > I'm currently benchmarking all SPEC CPU 2006 on Haswell. More > > > interesting is probably Zen where moves crossing the > > > integer - vector domain are excessively expensive (they get > > > done via the stack). > > > > There was a znver2 CPU machine not doing anything useful overnight here > > so I benchmarked your patch using SPEC 2006 and SPEC CPUrate 2017 on top > > of trunk r273663 (I forgot to pull, so before Honza's znver2 tuning > > patches, I am afraid). All benchmarks were run only once with options > > -Ofast -march=native -mtune=native. > > > > By far the biggest change was indeed 456.hmmer which improved by > > incredible 35%. There was no other change bigger than +- 1.5% in SPEC > > 2006 so the SPECint score grew by almost 3.4%. > > > > I understand this patch fixes a regression in that benchmark but even > > so, 456.hmmer built with the Monday trunk was 23% slower than with gcc 9 > > and with the patch is 20% faster than gcc 9. > > > > In SPEC 2017, there were two changes worth mentioning although they > > probably need to be confirmed and re-measured on top of the new tuning > > changes. 525.x264_r regressed by 3.37% and 511.povray_r improved by > > 3.04%. > > Thanks for checking. Meanwhile I figured how to restore the > effects of the patch without disabling the GPR alternative in > smaxsi3. The additional trick I need is avoid register class > preferencing from moves so *movsi_internal gets a few more *s > (in the end we'd need to split the r = g alternative because > for r = C we _do_ want to prefer general regs - unless it's > a special constant that can be loaded into a SSE reg? Or maybe > that's not needed and reload costs will take care of that). > > I've also needed to pessimize the GPR alternative in smaxsi3 > because that instruction is supposed to drive all the decision > as it is cheaper when done on SSE regs. > > Tunings still wreck things, like using -march=bdver2 will > give you one vpmaxsd and the rest in integer regs, including > an inter-unit move via the stack. > > Still this is the best I can get to with my limited .md / LRA > skills. > > Is avoiding register-class preferencing from moves good? I think > it makes sense at least. > > How would one write smaxsi3 as a splitter to be split after > reload in the case LRA assigned the GPR alternative? Is it > even worth doing? Even the SSE reg alternative can be split > to remove the not needed CC clobber. > > Finally I'm unsure about the add where I needed to place > the SSE alternative before the 2nd op memory one since it > otherwise gets the same cost and wins. > > So - how to go forward with this?
Sorry to come a bit late to the discussion. We are aware of CMOV issue for quite some time, but the issue is not understood yet in detail (I was hoping for Intel people to look at this). However, you demonstrated that using PMAX and PMIN instead of scalar CMOV can bring us big gains, and this thread now deals on how to best implement PMAX/PMIN for scalar code. I think that the way to go forward is with STV infrastructure. Currently, the implementation only deals with DImode on SSE2 32bit targets, but I see no issues on using STV pass also for SImode (on 32bit and 64bit targets). There are actually two STV passes, the first one (currently run on 64bit targets) is run before cse2, and the second (which currently runs on 32bit SSE2 only) is run after combine and before split1 pass. The second pass is interesting to us. The base idea of the second STV pass (for 32bit targets!) is that we introduce a DImode _doubleword instructons that otherwise do not exist with integer registers. Now, the passes up to and including combine pass can use these instructions to simplify and optimize the insn flow. Later, based on cost analysis, STV pass either converts the _doubleword instructions to a real vector ones (e.g. V2DImode patterns) or leaves them intact, and a follow-up split pass splits them into scalar SImode instruction pairs. STV pass also takes care to move and preload values from their scalar form to a vector representation (using SUBREGs). Please note that all this happens on pseudos, and register allocator will later simply use scalar (integer) registers in scalar patterns and vector registers with vector insn patterns. Your approach to amend existing scalar SImode patterns with vector registers will introduce no end of problems. Register allocator will do funny things during register pressure, where values will take a trip to a vector register before being stored to memory (and vice versa, you already found some of them). Current RA simply can't distinguish clearly between two register sets. So, my advice would be to use STV pass also for SImode values, on 64bit and 32bit targets. On both targets, we will be able to use instructions that operate on vector register set, and for 32bit targets (and to some extent on 64bit targets), we would perhaps be able to relax register pressure in a kind of controlled way. So, to demonstrate the benefits of existing STV pass, it should be relatively easy to introduce 64bit max/min pattern on 32bit target to handle 64bit values. For 32bit values, the pass should be re-run to convert SImode scalar operations to vector operations in a controlled way, based on various cost functions. Uros. > Thanks, > Richard. > > Index: gcc/config/i386/i386.md > =================================================================== > --- gcc/config/i386/i386.md (revision 273792) > +++ gcc/config/i386/i386.md (working copy) > @@ -1881,6 +1881,33 @@ (define_expand "mov<mode>" > "" > "ix86_expand_move (<MODE>mode, operands); DONE;") > > +(define_insn "smaxsi3" > + [(set (match_operand:SI 0 "register_operand" "=?r,v,x") > + (smax:SI (match_operand:SI 1 "register_operand" "%0,v,0") > + (match_operand:SI 2 "register_operand" "r,v,x"))) > + (clobber (reg:CC FLAGS_REG))] > + "" > +{ > + switch (get_attr_type (insn)) > + { > + case TYPE_SSEADD: > + if (which_alternative == 1) > + return "vpmaxsd\t{%2, %1, %0|%0, %1, %2}"; > + else > + return "pmaxsd\t{%2, %0|%0, %2}"; > + case TYPE_ICMOV: > + /* ??? Instead split this after reload? */ > + return "cmpl\t{%2, %0|%0, %2}\n" > + "\tcmovl\t{%2, %0|%0, %2}"; > + default: > + gcc_unreachable (); > + } > +} > + [(set_attr "isa" "*,avx,sse4_noavx") > + (set_attr "prefix" "orig,vex,orig") > + (set_attr "memory" "none") > + (set_attr "type" "icmov,sseadd,sseadd")]) > + > (define_insn "*mov<mode>_xor" > [(set (match_operand:SWI48 0 "register_operand" "=r") > (match_operand:SWI48 1 "const0_operand")) > @@ -2342,9 +2369,9 @@ (define_peephole2 > > (define_insn "*movsi_internal" > [(set (match_operand:SI 0 "nonimmediate_operand" > - "=r,m ,*y,*y,?*y,?m,?r,?*y,*v,*v,*v,m ,?r,?*v,*k,*k ,*rm,*k") > + "=*r,m ,*y,*y,?*y,?m,?r,?*y,*v,*v,*v,m ,?r,?*v,*k,*k ,*rm,*k") > (match_operand:SI 1 "general_operand" > - "g ,re,C ,*y,m ,*y,*y,r ,C ,*v,m ,*v,*v,r ,*r,*km,*k ,CBC"))] > + "g ,*re,C ,*y,m ,*y,*y,r ,C ,*v,m ,*v,*v,r ,*r,*km,*k ,CBC"))] > "!(MEM_P (operands[0]) && MEM_P (operands[1]))" > { > switch (get_attr_type (insn)) > @@ -5368,10 +5395,10 @@ (define_insn_and_split "*add<dwi>3_doubl > }) > > (define_insn "*add<mode>_1" > - [(set (match_operand:SWI48 0 "nonimmediate_operand" "=rm,r,r,r") > + [(set (match_operand:SWI48 0 "nonimmediate_operand" "=rm,v,x,r,r,r") > (plus:SWI48 > - (match_operand:SWI48 1 "nonimmediate_operand" "%0,0,r,r") > - (match_operand:SWI48 2 "x86_64_general_operand" "re,m,0,le"))) > + (match_operand:SWI48 1 "nonimmediate_operand" "%0,v,0,0,r,r") > + (match_operand:SWI48 2 "x86_64_general_operand" "re,v,x,*m,0,le"))) > (clobber (reg:CC FLAGS_REG))] > "ix86_binary_operator_ok (PLUS, <MODE>mode, operands)" > { > @@ -5390,10 +5417,23 @@ (define_insn "*add<mode>_1" > return "dec{<imodesuffix>}\t%0"; > } > > + case TYPE_SSEADD: > + if (which_alternative == 1) > + { > + if (<MODE>mode == SImode) > + return "%vpaddd\t{%2, %1, %0|%0, %1, %2}"; > + else > + return "%vpaddq\t{%2, %1, %0|%0, %1, %2}"; > + } > + else if (<MODE>mode == SImode) > + return "paddd\t{%2, %0|%0, %2}"; > + else > + return "paddq\t{%2, %0|%0, %2}"; > + > default: > /* For most processors, ADD is faster than LEA. This alternative > was added to use ADD as much as possible. */ > - if (which_alternative == 2) > + if (which_alternative == 4) > std::swap (operands[1], operands[2]); > > gcc_assert (rtx_equal_p (operands[0], operands[1])); > @@ -5403,9 +5443,14 @@ (define_insn "*add<mode>_1" > return "add{<imodesuffix>}\t{%2, %0|%0, %2}"; > } > } > - [(set (attr "type") > - (cond [(eq_attr "alternative" "3") > + [(set_attr "isa" "*,avx,sse2,*,*,*") > + (set (attr "type") > + (cond [(eq_attr "alternative" "5") > (const_string "lea") > + (eq_attr "alternative" "1") > + (const_string "sseadd") > + (eq_attr "alternative" "2") > + (const_string "sseadd") > (match_operand:SWI48 2 "incdec_operand") > (const_string "incdec") > ] > Index: gcc/config/i386/i386.c > =================================================================== > --- gcc/config/i386/i386.c (revision 273792) > +++ gcc/config/i386/i386.c (working copy) > @@ -14616,6 +14616,9 @@ ix86_lea_for_add_ok (rtx_insn *insn, rtx > unsigned int regno1 = true_regnum (operands[1]); > unsigned int regno2 = true_regnum (operands[2]); > > + if (SSE_REGNO_P (regno1)) > + return false; > + > /* If a = b + c, (a!=b && a!=c), must use lea form. */ > if (regno0 != regno1 && regno0 != regno2) > return true;