On Mon, 7 Jul 2025, Hongtao Liu wrote: > On Mon, Jul 7, 2025 at 3:18 PM Hongtao Liu <crazy...@gmail.com> wrote: > > > > On Fri, Jul 4, 2025 at 5:45 PM Richard Biener <rguent...@suse.de> wrote: > > > > > > The following adds a x86 tuning to enable the use of AVX512 masked > > > epilogues in cases we heuristically determine it to be not detrimental > > > by high chance. Basically problematic cases are when there are > > > data streams that are both stored and loaded from and an outer loop > > > could end up executing only the inner loop masked epilogue and with > > > unlucky data stream advacement from the outer loop end up needing > > > to forward from masked stores to masked loads. This isn't very > > > well handled, esp. for the case where unmasked operations would > > > not need to forward at all - that is, when forwarding completely > > > from the masked out portion of the store (like the AVX upper half > > > to the AVX lower half of a load). There's also the case where > > > the number of iterations is known at compile time, only with > > > cost comparing we'd consider a non-masked epilog - as we are not > > > doing that we have to add heuristics to avoid masking when a > > > single vector epilog iteration would cover all scalar iterations > > > left (this is exercised by gcc.target/i386/pr110310.c). > > > > > > SPEC CPU 2017 shows 3% text size savings over not using masked > > > epilogues with performance impact in the noise. Masking all vector > > > epilogues gets that to 4% text size savings with some major > > > runtime regressions in 503.bwaves_r and 527.cam4_r > > > (measured on a Zen4 system), we're leaving a 5% improvement > > > for 549.fotonik3d_r unrealized with the implemented heuristic. > > It looks interesting. > > I'll try with avx256_masked_epilougues to see if there's something unusual. > Oh, no need for a new tune, avx512_masked_epilogues can directly be > applied to those avx256_optimal avx512 processors, great!!!
Yes, it might be misnamed - it refers to the architectural masking feature of AVX512 but extends to avx512vl, thus SSE and AVX2 vector widths. There's the possibility to add additional heuristics of course. Richard. > > > > > > With the heuristics we turn 22513 vector epilogues + up to 12305 scalar > > > epilogues into 12305 masked vector epilogues of which 574 are for > > > AVX vector sizes, 79 for SSE vector sizes and the rest for AVX512. > > > When masking all epilogues we get 14567 of them from > > > 29467 vector + up to 14567 scalar epilogues, so the heuristics disable > > > an additional 20% of masked epilogues. > > > > > > Bootstrapped and tested on x86_64-unknown-linux-gnu. > > > > > > OK? > > > > > > Thanks, > > > Richard. > > > > > > * config/i386/x86-tune.def (X86_TUNE_AVX512_MASKED_EPILOGUES): > > > New tunable, default on for m_ZNVER4 and m_ZNVER5. > > > * config/i386/i386.cc (ix86_vector_costs::finish_cost): With > > > X86_TUNE_AVX512_MASKED_EPILOGUES and when the main loop > > > had a vectorization factor > 2 use a masked epilogue when > > > possible and when not obviously problematic. > > > > > > * gcc.target/i386/vect-mask-epilogue-1.c: New testcase. > > > * gcc.target/i386/vect-mask-epilogue-2.c: Likewise. > > > * gcc.target/i386/vect-epilogues-3.c: Adjust. > > > --- > > > gcc/config/i386/i386.cc | 59 +++++++++++++++++++ > > > gcc/config/i386/x86-tune.def | 5 ++ > > > .../gcc.target/i386/vect-epilogues-3.c | 2 +- > > > .../gcc.target/i386/vect-mask-epilogue-1.c | 11 ++++ > > > .../gcc.target/i386/vect-mask-epilogue-2.c | 14 +++++ > > > 5 files changed, 90 insertions(+), 1 deletion(-) > > > create mode 100644 gcc/testsuite/gcc.target/i386/vect-mask-epilogue-1.c > > > create mode 100644 gcc/testsuite/gcc.target/i386/vect-mask-epilogue-2.c > > > > > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc > > > index b64175d6c93..8e796ea4033 100644 > > > --- a/gcc/config/i386/i386.cc > > > +++ b/gcc/config/i386/i386.cc > > > @@ -26295,6 +26295,65 @@ ix86_vector_costs::finish_cost (const > > > vector_costs *scalar_costs) > > > && LOOP_VINFO_VECT_FACTOR (loop_vinfo).to_constant () >= 16) > > > m_suggested_epilogue_mode = V8QImode; > > > > > > + /* When X86_TUNE_AVX512_MASKED_EPILOGUES is enabled try to use > > > + a masked epilogue if that doesn't seem detrimental. */ > > > + if (loop_vinfo > > > + && !LOOP_VINFO_EPILOGUE_P (loop_vinfo) > > > + && LOOP_VINFO_VECT_FACTOR (loop_vinfo).to_constant () > 2 > > > + && ix86_tune_features[X86_TUNE_AVX512_MASKED_EPILOGUES] > > > + && !OPTION_SET_P (param_vect_partial_vector_usage)) > > > + { > > > + bool avoid = false; > > > + if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo) > > > + && LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo) >= 0) > > > + { > > > + unsigned int peel_niter > > > + = LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo); > > > + if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)) > > > + peel_niter += 1; > > > + /* When we know the number of scalar iterations of the epilogue, > > > + avoid masking when a single vector epilog iteration handles > > > + it in full. */ > > > + if (pow2p_hwi ((LOOP_VINFO_INT_NITERS (loop_vinfo) - peel_niter) > > > + % LOOP_VINFO_VECT_FACTOR > > > (loop_vinfo).to_constant ())) > > > + avoid = true; > > > + } > > > + if (!avoid && loop_outer (loop_outer (LOOP_VINFO_LOOP > > > (loop_vinfo)))) > > > + for (auto ddr : LOOP_VINFO_DDRS (loop_vinfo)) > > > + { > > > + if (DDR_ARE_DEPENDENT (ddr) == chrec_known) > > > + ; > > > + else if (DDR_ARE_DEPENDENT (ddr) == chrec_dont_know) > > > + ; > > > + else > > > + { > > > + int loop_depth > > > + = index_in_loop_nest (LOOP_VINFO_LOOP > > > (loop_vinfo)->num, > > > + DDR_LOOP_NEST (ddr)); > > > + if (DDR_NUM_DIST_VECTS (ddr) == 1 > > > + && DDR_DIST_VECTS (ddr)[0][loop_depth] == 0) > > > + { > > > + /* Avoid the case when there's an outer loop that > > > might > > > + traverse a multi-dimensional array with the inner > > > + loop just executing the masked epilogue with a > > > + read-write where the next outer iteration might > > > + read from the masked part of the previous write, > > > + 'n' filling half a vector. > > > + for (j = 0; j < m; ++j) > > > + for (i = 0; i < n; ++i) > > > + a[j][i] = c * a[j][i]; */ > > > + avoid = true; > > > + break; > > > + } > > > + } > > > + } > > > + if (!avoid) > > > + { > > > + m_suggested_epilogue_mode = loop_vinfo->vector_mode; > > > + m_masked_epilogue = 1; > > > + } > > > + } > > > + > > > vector_costs::finish_cost (scalar_costs); > > > } > > > > > > diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def > > > index 91cdca7fbfc..4773e5dd5ad 100644 > > > --- a/gcc/config/i386/x86-tune.def > > > +++ b/gcc/config/i386/x86-tune.def > > > @@ -639,6 +639,11 @@ DEF_TUNE (X86_TUNE_AVX512_STORE_BY_PIECES, > > > "avx512_store_by_pieces", > > > DEF_TUNE (X86_TUNE_AVX512_TWO_EPILOGUES, "avx512_two_epilogues", > > > m_ZNVER4 | m_ZNVER5) > > > > > > +/* X86_TUNE_AVX512_MAKED_EPILOGUES: Use two masked vector epilogues > > > + when fit. */ > > > +DEF_TUNE (X86_TUNE_AVX512_MASKED_EPILOGUES, "avx512_masked_epilogues", > > > + m_ZNVER4 | m_ZNVER5) > > > + > > > > > > /*****************************************************************************/ > > > > > > /*****************************************************************************/ > > > /* Historical relics: tuning flags that helps a specific old CPU designs > > > */ > > > diff --git a/gcc/testsuite/gcc.target/i386/vect-epilogues-3.c > > > b/gcc/testsuite/gcc.target/i386/vect-epilogues-3.c > > > index 0ee610f5e3e..e88ab30c770 100644 > > > --- a/gcc/testsuite/gcc.target/i386/vect-epilogues-3.c > > > +++ b/gcc/testsuite/gcc.target/i386/vect-epilogues-3.c > > > @@ -1,5 +1,5 @@ > > > /* { dg-do compile } */ > > > -/* { dg-options "-O3 -mavx512bw -mtune=znver4 > > > -fdump-tree-vect-optimized" } */ > > > +/* { dg-options "-O3 -mavx512bw -mtune=znver4 --param > > > vect-partial-vector-usage=0 -fdump-tree-vect-optimized" } */ > > > > > > int test (signed char *data, int n) > > > { > > > diff --git a/gcc/testsuite/gcc.target/i386/vect-mask-epilogue-1.c > > > b/gcc/testsuite/gcc.target/i386/vect-mask-epilogue-1.c > > > new file mode 100644 > > > index 00000000000..55519aa87fd > > > --- /dev/null > > > +++ b/gcc/testsuite/gcc.target/i386/vect-mask-epilogue-1.c > > > @@ -0,0 +1,11 @@ > > > +/* { dg-do compile } */ > > > +/* { dg-options "-O3 -march=znver5 -fdump-tree-vect-optimized" } */ > > > + > > > +void bar (double *a, double *b, double c, int n, int m) > > > +{ > > > + for (int j = 0; j < m; ++j) > > > + for (int i = 0; i < n; ++i) > > > + a[j*n + i] = b[j*n + i] + c; > > > +} > > > + > > > +/* { dg-final { scan-tree-dump "epilogue loop vectorized using masked 64 > > > byte vectors" "vect" } } */ > > > diff --git a/gcc/testsuite/gcc.target/i386/vect-mask-epilogue-2.c > > > b/gcc/testsuite/gcc.target/i386/vect-mask-epilogue-2.c > > > new file mode 100644 > > > index 00000000000..3dc28b39b62 > > > --- /dev/null > > > +++ b/gcc/testsuite/gcc.target/i386/vect-mask-epilogue-2.c > > > @@ -0,0 +1,14 @@ > > > +/* { dg-do compile } */ > > > +/* { dg-options "-O3 -march=znver5 -fdump-tree-vect-optimized" } */ > > > + > > > +void foo (double *a, double b, double c, int n, int m) > > > +{ > > > + for (int j = 0; j < m; ++j) > > > + for (int i = 0; i < n; ++i) > > > + a[j*n + i] = a[j*n + i] * b + c; > > > +} > > > + > > > +/* We do not want to use a masked epilogue for the inner loop as the next > > > + outer iteration will possibly immediately read from elements masked of > > > + the previous inner loop epilogue and that never forwards. */ > > > +/* { dg-final { scan-tree-dump "epilogue loop vectorized using 32 byte > > > vectors" "vect" } } */ > > > -- > > > 2.43.0 > > > > > > > > -- > > BR, > > Hongtao > > > > -- Richard Biener <rguent...@suse.de> SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)