> The following adds a x86 tuning to enable the use of AVX512 masked
> epilogues in cases we heuristically determine it to be not detrimental
> by high chance.  Basically problematic cases are when there are
> data streams that are both stored and loaded from and an outer loop
> could end up executing only the inner loop masked epilogue and with
> unlucky data stream advacement from the outer loop end up needing
> to forward from masked stores to masked loads.  This isn't very
> well handled, esp. for the case where unmasked operations would
> not need to forward at all - that is, when forwarding completely
> from the masked out portion of the store (like the AVX upper half
> to the AVX lower half of a load).  There's also the case where
> the number of iterations is known at compile time, only with
> cost comparing we'd consider a non-masked epilog - as we are not
> doing that we have to add heuristics to avoid masking when a
> single vector epilog iteration would cover all scalar iterations
> left (this is exercised by gcc.target/i386/pr110310.c).
> 
> SPEC CPU 2017 shows 3% text size savings over not using masked
> epilogues with performance impact in the noise.  Masking all vector
> epilogues gets that to 4% text size savings with some major
> runtime regressions in 503.bwaves_r and 527.cam4_r
> (measured on a Zen4 system), we're leaving a 5% improvement
> for 549.fotonik3d_r unrealized with the implemented heuristic.
> 
> With the heuristics we turn 22513 vector epilogues + up to 12305 scalar
> epilogues into 12305 masked vector epilogues of which 574 are for
> AVX vector sizes, 79 for SSE vector sizes and the rest for AVX512.
> When masking all epilogues we get 14567 of them from
> 29467 vector + up to 14567 scalar epilogues, so the heuristics disable
> an additional 20% of masked epilogues.
> 
> Bootstrapped and tested on x86_64-unknown-linux-gnu.
> 
> OK?
> 
> Thanks,
> Richard.
> 
>       * config/i386/x86-tune.def (X86_TUNE_AVX512_MASKED_EPILOGUES):
>       New tunable, default on for m_ZNVER4 and m_ZNVER5.
>       * config/i386/i386.cc (ix86_vector_costs::finish_cost): With
>       X86_TUNE_AVX512_MASKED_EPILOGUES and when the main loop
>       had a vectorization factor > 2 use a masked epilogue when
>       possible and when not obviously problematic.
> 
>       * gcc.target/i386/vect-mask-epilogue-1.c: New testcase.
>       * gcc.target/i386/vect-mask-epilogue-2.c: Likewise.
>       * gcc.target/i386/vect-epilogues-3.c: Adjust.
OK,
thanks!
Honza

Reply via email to