Hello, On Tue, Jul 23 2019, Richard Biener wrote: > The following fixes the runtime regression of 456.hmmer caused > by matching ICC in code generation and using cmov more aggressively > (through GIMPLE level MAX_EXPR usage). Appearantly (discovered > by manual assembler editing) using the SSE unit for performing > SImode loads, adds and then two singed max operations plus stores > is quite a bit faster than cmovs - even faster than the original > single cmov plus branchy second max. Even more so for AMD CPUs > than Intel CPUs. > > Instead of hacking up some pattern recognition pass to transform > integer mode memory-to-memory computation chains involving > conditional moves to "vector" code (similar to what STV does > for TImode ops on x86_64) the following simply allows SImode > into SSE registers (support for this is already there in some > important places like move patterns!). For the particular > case of 456.hmmer the required support is loads/stores > (already implemented), SImode adds and SImode smax. > > So the patch adds a smax pattern for SImode (we don't have any > for scalar modes but currently expand via a conditional move sequence) > emitting as SSE vector max or cmp/cmov depending on the alternative. > > And it amends the *add<mode>_1 pattern with SSE alternatives > (which have to come before the memory alternative as IRA otherwise > doesn't consider reloading a memory operand to a register). > > With this in place the runtime of 456.hmmer improves by 10% > on Haswell which is back to before regression speed but not > to same levels as seen with manually editing just the single > important loop. > > I'm currently benchmarking all SPEC CPU 2006 on Haswell. More > interesting is probably Zen where moves crossing the > integer - vector domain are excessively expensive (they get > done via the stack).
There was a znver2 CPU machine not doing anything useful overnight here so I benchmarked your patch using SPEC 2006 and SPEC CPUrate 2017 on top of trunk r273663 (I forgot to pull, so before Honza's znver2 tuning patches, I am afraid). All benchmarks were run only once with options -Ofast -march=native -mtune=native. By far the biggest change was indeed 456.hmmer which improved by incredible 35%. There was no other change bigger than +- 1.5% in SPEC 2006 so the SPECint score grew by almost 3.4%. I understand this patch fixes a regression in that benchmark but even so, 456.hmmer built with the Monday trunk was 23% slower than with gcc 9 and with the patch is 20% faster than gcc 9. In SPEC 2017, there were two changes worth mentioning although they probably need to be confirmed and re-measured on top of the new tuning changes. 525.x264_r regressed by 3.37% and 511.povray_r improved by 3.04%. Martin > > Clearly this approach will run into register allocation issues > but it looks cleaner than writing yet another STV-like pass > (STV itself is quite awkwardly structured so I refrain from > touching it...). > > Anyway - comments? It seems to me that MMX-in-SSE does > something very similar. > > Bootstrapped on x86_64-unknown-linux-gnu, previous testing > revealed some issue. Forgot that *add<mode>_1 also handles > DImode..., fixed below, re-testing in progress. > > Thanks, > Richard. > > 2019-07-23 Richard Biener <rguent...@suse.de> > > PR target/91154 > * config/i386/i386.md (smaxsi3): New. > (*add<mode>_1): Add SSE and AVX variants. > * config/i386/i386.c (ix86_lea_for_add_ok): Do not allow > SSE registers. >