Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs

Martin Jambor Thu, 25 Jul 2019 02:14:08 -0700

Hello,

On Tue, Jul 23 2019, Richard Biener wrote:
> The following fixes the runtime regression of 456.hmmer caused
> by matching ICC in code generation and using cmov more aggressively
> (through GIMPLE level MAX_EXPR usage).  Appearantly (discovered
> by manual assembler editing) using the SSE unit for performing
> SImode loads, adds and then two singed max operations plus stores
> is quite a bit faster than cmovs - even faster than the original
> single cmov plus branchy second max.  Even more so for AMD CPUs
> than Intel CPUs.
>
> Instead of hacking up some pattern recognition pass to transform
> integer mode memory-to-memory computation chains involving
> conditional moves to "vector" code (similar to what STV does
> for TImode ops on x86_64) the following simply allows SImode
> into SSE registers (support for this is already there in some
> important places like move patterns!).  For the particular
> case of 456.hmmer the required support is loads/stores
> (already implemented), SImode adds and SImode smax.
>
> So the patch adds a smax pattern for SImode (we don't have any
> for scalar modes but currently expand via a conditional move sequence)
> emitting as SSE vector max or cmp/cmov depending on the alternative.
>
> And it amends the *add<mode>_1 pattern with SSE alternatives
> (which have to come before the memory alternative as IRA otherwise
> doesn't consider reloading a memory operand to a register).
>
> With this in place the runtime of 456.hmmer improves by 10%
> on Haswell which is back to before regression speed but not
> to same levels as seen with manually editing just the single
> important loop.
>
> I'm currently benchmarking all SPEC CPU 2006 on Haswell.  More
> interesting is probably Zen where moves crossing the
> integer - vector domain are excessively expensive (they get
> done via the stack).


There was a znver2 CPU machine not doing anything useful overnight here
so I benchmarked your patch using SPEC 2006 and SPEC CPUrate 2017 on top
of trunk r273663 (I forgot to pull, so before Honza's znver2 tuning
patches, I am afraid).  All benchmarks were run only once with options
-Ofast -march=native -mtune=native.

By far the biggest change was indeed 456.hmmer which improved by
incredible 35%.  There was no other change bigger than +- 1.5% in SPEC
2006 so the SPECint score grew by almost 3.4%.

I understand this patch fixes a regression in that benchmark but even
so, 456.hmmer built with the Monday trunk was 23% slower than with gcc 9
and with the patch is 20% faster than gcc 9.

In SPEC 2017, there were two changes worth mentioning although they
probably need to be confirmed and re-measured on top of the new tuning
changes.  525.x264_r regressed by 3.37% and 511.povray_r improved by
3.04%.

Martin


>
> Clearly this approach will run into register allocation issues
> but it looks cleaner than writing yet another STV-like pass
> (STV itself is quite awkwardly structured so I refrain from
> touching it...).
>
> Anyway - comments?  It seems to me that MMX-in-SSE does
> something very similar.
>
> Bootstrapped on x86_64-unknown-linux-gnu, previous testing
> revealed some issue.  Forgot that *add<mode>_1 also handles
> DImode..., fixed below, re-testing in progress.
>
> Thanks,
> Richard.
>
> 2019-07-23  Richard Biener  <rguent...@suse.de>
>
>       PR target/91154
>       * config/i386/i386.md (smaxsi3): New.
>       (*add<mode>_1): Add SSE and AVX variants.
>       * config/i386/i386.c (ix86_lea_for_add_ok): Do not allow
>       SSE registers.
>

Re: [PATCH][RFC][x86] Fix PR91154, add SImode smax, allow SImode add in SSE regs

Reply via email to