Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs

Uros Bizjak Sat, 27 Jul 2019 03:08:49 -0700

On Thu, Jul 25, 2019 at 2:21 PM Richard Biener <rguent...@suse.de> wrote:
>
> On Thu, 25 Jul 2019, Martin Jambor wrote:
>
> > Hello,
> >
> > On Tue, Jul 23 2019, Richard Biener wrote:
> > > The following fixes the runtime regression of 456.hmmer caused
> > > by matching ICC in code generation and using cmov more aggressively
> > > (through GIMPLE level MAX_EXPR usage).  Appearantly (discovered
> > > by manual assembler editing) using the SSE unit for performing
> > > SImode loads, adds and then two singed max operations plus stores
> > > is quite a bit faster than cmovs - even faster than the original
> > > single cmov plus branchy second max.  Even more so for AMD CPUs
> > > than Intel CPUs.
> > >
> > > Instead of hacking up some pattern recognition pass to transform
> > > integer mode memory-to-memory computation chains involving
> > > conditional moves to "vector" code (similar to what STV does
> > > for TImode ops on x86_64) the following simply allows SImode
> > > into SSE registers (support for this is already there in some
> > > important places like move patterns!).  For the particular
> > > case of 456.hmmer the required support is loads/stores
> > > (already implemented), SImode adds and SImode smax.
> > >
> > > So the patch adds a smax pattern for SImode (we don't have any
> > > for scalar modes but currently expand via a conditional move sequence)
> > > emitting as SSE vector max or cmp/cmov depending on the alternative.
> > >
> > > And it amends the *add<mode>_1 pattern with SSE alternatives
> > > (which have to come before the memory alternative as IRA otherwise
> > > doesn't consider reloading a memory operand to a register).
> > >
> > > With this in place the runtime of 456.hmmer improves by 10%
> > > on Haswell which is back to before regression speed but not
> > > to same levels as seen with manually editing just the single
> > > important loop.
> > >
> > > I'm currently benchmarking all SPEC CPU 2006 on Haswell.  More
> > > interesting is probably Zen where moves crossing the
> > > integer - vector domain are excessively expensive (they get
> > > done via the stack).
> >
> > There was a znver2 CPU machine not doing anything useful overnight here
> > so I benchmarked your patch using SPEC 2006 and SPEC CPUrate 2017 on top
> > of trunk r273663 (I forgot to pull, so before Honza's znver2 tuning
> > patches, I am afraid).  All benchmarks were run only once with options
> > -Ofast -march=native -mtune=native.
> >
> > By far the biggest change was indeed 456.hmmer which improved by
> > incredible 35%.  There was no other change bigger than +- 1.5% in SPEC
> > 2006 so the SPECint score grew by almost 3.4%.
> >
> > I understand this patch fixes a regression in that benchmark but even
> > so, 456.hmmer built with the Monday trunk was 23% slower than with gcc 9
> > and with the patch is 20% faster than gcc 9.
> >
> > In SPEC 2017, there were two changes worth mentioning although they
> > probably need to be confirmed and re-measured on top of the new tuning
> > changes.  525.x264_r regressed by 3.37% and 511.povray_r improved by
> > 3.04%.
>
> Thanks for checking.  Meanwhile I figured how to restore the
> effects of the patch without disabling the GPR alternative in
> smaxsi3.  The additional trick I need is avoid register class
> preferencing from moves so *movsi_internal gets a few more *s
> (in the end we'd need to split the r = g alternative because
> for r = C we _do_ want to prefer general regs - unless it's
> a special constant that can be loaded into a SSE reg?  Or maybe
> that's not needed and reload costs will take care of that).
>
> I've also needed to pessimize the GPR alternative in smaxsi3
> because that instruction is supposed to drive all the decision
> as it is cheaper when done on SSE regs.
>
> Tunings still wreck things, like using -march=bdver2 will
> give you one vpmaxsd and the rest in integer regs, including
> an inter-unit move via the stack.
>
> Still this is the best I can get to with my limited .md / LRA
> skills.
>
> Is avoiding register-class preferencing from moves good?  I think
> it makes sense at least.
>
> How would one write smaxsi3 as a splitter to be split after
> reload in the case LRA assigned the GPR alternative?  Is it
> even worth doing?  Even the SSE reg alternative can be split
> to remove the not needed CC clobber.
>
> Finally I'm unsure about the add where I needed to place
> the SSE alternative before the 2nd op memory one since it
> otherwise gets the same cost and wins.
>
> So - how to go forward with this?


Sorry to come a bit late to the discussion.

We are aware of CMOV issue for quite some time, but the issue is not
understood yet in detail (I was hoping for Intel people to look at
this). However, you demonstrated that using PMAX and PMIN  instead of
scalar CMOV can bring us big gains, and this thread now deals on how
to best implement PMAX/PMIN for scalar code.

I think that the way to go forward is with STV infrastructure.
Currently, the implementation only deals with DImode on SSE2 32bit
targets, but I see no issues on using STV pass also for SImode (on
32bit and 64bit targets). There are actually two STV passes, the first
one (currently run on 64bit targets) is run before cse2, and the
second (which currently runs on 32bit SSE2 only) is run after combine
and before split1 pass. The second pass is interesting to us.

The base idea of the second STV pass (for 32bit targets!) is that we
introduce a DImode _doubleword instructons that otherwise do not exist
with integer registers. Now, the passes up to and including combine
pass can use these instructions to simplify and optimize the insn
flow. Later, based on cost analysis, STV pass either converts the
_doubleword instructions to a real vector ones (e.g. V2DImode
patterns) or leaves them intact, and a follow-up split pass splits
them into scalar SImode instruction pairs. STV pass also takes care to
move and preload values from their scalar form to a vector
representation (using SUBREGs). Please note that all this happens on
pseudos, and register allocator will later simply use scalar (integer)
registers in scalar patterns and vector registers with vector insn
patterns.

Your approach to amend existing scalar SImode patterns with vector
registers will introduce no end of problems. Register allocator will
do funny things during register pressure, where values will take a
trip to a vector register before being stored to memory (and vice
versa, you already found some of them). Current RA simply can't
distinguish clearly between two register sets.

So, my advice would be to use STV pass also for SImode values, on
64bit and 32bit targets. On both targets, we will be able to use
instructions that operate on vector register set, and for 32bit
targets (and to some extent on 64bit targets), we would perhaps be
able to relax register pressure in a kind of controlled way.

So, to demonstrate the benefits of existing STV pass, it should be
relatively easy to introduce 64bit max/min pattern on 32bit target to
handle 64bit values. For 32bit values, the pass should be re-run to
convert SImode scalar operations to vector operations in a controlled
way, based on various cost functions.

Uros.

> Thanks,
> Richard.
>
> Index: gcc/config/i386/i386.md
> ===================================================================
> --- gcc/config/i386/i386.md     (revision 273792)
> +++ gcc/config/i386/i386.md     (working copy)
> @@ -1881,6 +1881,33 @@ (define_expand "mov<mode>"
>    ""
>    "ix86_expand_move (<MODE>mode, operands); DONE;")
>
> +(define_insn "smaxsi3"
> + [(set (match_operand:SI 0 "register_operand" "=?r,v,x")
> +       (smax:SI (match_operand:SI 1 "register_operand" "%0,v,0")
> +                (match_operand:SI 2 "register_operand" "r,v,x")))
> +  (clobber (reg:CC FLAGS_REG))]
> +  ""
> +{
> +  switch (get_attr_type (insn))
> +    {
> +    case TYPE_SSEADD:
> +      if (which_alternative == 1)
> +        return "vpmaxsd\t{%2, %1, %0|%0, %1, %2}";
> +      else
> +        return "pmaxsd\t{%2, %0|%0, %2}";
> +    case TYPE_ICMOV:
> +      /* ???  Instead split this after reload?  */
> +      return "cmpl\t{%2, %0|%0, %2}\n"
> +           "\tcmovl\t{%2, %0|%0, %2}";
> +    default:
> +      gcc_unreachable ();
> +    }
> +}
> +  [(set_attr "isa" "*,avx,sse4_noavx")
> +   (set_attr "prefix" "orig,vex,orig")
> +   (set_attr "memory" "none")
> +   (set_attr "type" "icmov,sseadd,sseadd")])
> +
>  (define_insn "*mov<mode>_xor"
>    [(set (match_operand:SWI48 0 "register_operand" "=r")
>         (match_operand:SWI48 1 "const0_operand"))
> @@ -2342,9 +2369,9 @@ (define_peephole2
>
>  (define_insn "*movsi_internal"
>    [(set (match_operand:SI 0 "nonimmediate_operand"
> -    "=r,m ,*y,*y,?*y,?m,?r,?*y,*v,*v,*v,m ,?r,?*v,*k,*k ,*rm,*k")
> +    "=*r,m  ,*y,*y,?*y,?m,?r,?*y,*v,*v,*v,m ,?r,?*v,*k,*k ,*rm,*k")
>         (match_operand:SI 1 "general_operand"
> -    "g ,re,C ,*y,m  ,*y,*y,r  ,C ,*v,m ,*v,*v,r  ,*r,*km,*k ,CBC"))]
> +    "g  ,*re,C ,*y,m  ,*y,*y,r  ,C ,*v,m ,*v,*v,r  ,*r,*km,*k ,CBC"))]
>    "!(MEM_P (operands[0]) && MEM_P (operands[1]))"
>  {
>    switch (get_attr_type (insn))
> @@ -5368,10 +5395,10 @@ (define_insn_and_split "*add<dwi>3_doubl
>  })
>
>  (define_insn "*add<mode>_1"
> -  [(set (match_operand:SWI48 0 "nonimmediate_operand" "=rm,r,r,r")
> +  [(set (match_operand:SWI48 0 "nonimmediate_operand" "=rm,v,x,r,r,r")
>         (plus:SWI48
> -         (match_operand:SWI48 1 "nonimmediate_operand" "%0,0,r,r")
> -         (match_operand:SWI48 2 "x86_64_general_operand" "re,m,0,le")))
> +         (match_operand:SWI48 1 "nonimmediate_operand" "%0,v,0,0,r,r")
> +         (match_operand:SWI48 2 "x86_64_general_operand" "re,v,x,*m,0,le")))
>     (clobber (reg:CC FLAGS_REG))]
>    "ix86_binary_operator_ok (PLUS, <MODE>mode, operands)"
>  {
> @@ -5390,10 +5417,23 @@ (define_insn "*add<mode>_1"
>            return "dec{<imodesuffix>}\t%0";
>         }
>
> +    case TYPE_SSEADD:
> +      if (which_alternative == 1)
> +        {
> +          if (<MODE>mode == SImode)
> +           return "%vpaddd\t{%2, %1, %0|%0, %1, %2}";
> +         else
> +           return "%vpaddq\t{%2, %1, %0|%0, %1, %2}";
> +       }
> +      else if (<MODE>mode == SImode)
> +       return "paddd\t{%2, %0|%0, %2}";
> +      else
> +       return "paddq\t{%2, %0|%0, %2}";
> +
>      default:
>        /* For most processors, ADD is faster than LEA.  This alternative
>          was added to use ADD as much as possible.  */
> -      if (which_alternative == 2)
> +      if (which_alternative == 4)
>          std::swap (operands[1], operands[2]);
>
>        gcc_assert (rtx_equal_p (operands[0], operands[1]));
> @@ -5403,9 +5443,14 @@ (define_insn "*add<mode>_1"
>        return "add{<imodesuffix>}\t{%2, %0|%0, %2}";
>      }
>  }
> -  [(set (attr "type")
> -     (cond [(eq_attr "alternative" "3")
> +  [(set_attr "isa" "*,avx,sse2,*,*,*")
> +   (set (attr "type")
> +     (cond [(eq_attr "alternative" "5")
>                (const_string "lea")
> +           (eq_attr "alternative" "1")
> +             (const_string "sseadd")
> +           (eq_attr "alternative" "2")
> +             (const_string "sseadd")
>             (match_operand:SWI48 2 "incdec_operand")
>               (const_string "incdec")
>            ]
> Index: gcc/config/i386/i386.c
> ===================================================================
> --- gcc/config/i386/i386.c      (revision 273792)
> +++ gcc/config/i386/i386.c      (working copy)
> @@ -14616,6 +14616,9 @@ ix86_lea_for_add_ok (rtx_insn *insn, rtx
>    unsigned int regno1 = true_regnum (operands[1]);
>    unsigned int regno2 = true_regnum (operands[2]);
>
> +  if (SSE_REGNO_P (regno1))
> +    return false;
> +
>    /* If a = b + c, (a!=b && a!=c), must use lea form. */
>    if (regno0 != regno1 && regno0 != regno2)
>      return true;

Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs

Reply via email to