Re: [PATCH 2/2] add masked-epilogue tuning

Richard Biener Mon, 07 Jul 2025 01:10:48 -0700

On Mon, 7 Jul 2025, Hongtao Liu wrote:

> On Mon, Jul 7, 2025 at 3:18 PM Hongtao Liu <crazy...@gmail.com> wrote:
> >
> > On Fri, Jul 4, 2025 at 5:45 PM Richard Biener <rguent...@suse.de> wrote:
> > >
> > > The following adds a x86 tuning to enable the use of AVX512 masked
> > > epilogues in cases we heuristically determine it to be not detrimental
> > > by high chance.  Basically problematic cases are when there are
> > > data streams that are both stored and loaded from and an outer loop
> > > could end up executing only the inner loop masked epilogue and with
> > > unlucky data stream advacement from the outer loop end up needing
> > > to forward from masked stores to masked loads.  This isn't very
> > > well handled, esp. for the case where unmasked operations would
> > > not need to forward at all - that is, when forwarding completely
> > > from the masked out portion of the store (like the AVX upper half
> > > to the AVX lower half of a load).  There's also the case where
> > > the number of iterations is known at compile time, only with
> > > cost comparing we'd consider a non-masked epilog - as we are not
> > > doing that we have to add heuristics to avoid masking when a
> > > single vector epilog iteration would cover all scalar iterations
> > > left (this is exercised by gcc.target/i386/pr110310.c).
> > >
> > > SPEC CPU 2017 shows 3% text size savings over not using masked
> > > epilogues with performance impact in the noise.  Masking all vector
> > > epilogues gets that to 4% text size savings with some major
> > > runtime regressions in 503.bwaves_r and 527.cam4_r
> > > (measured on a Zen4 system), we're leaving a 5% improvement
> > > for 549.fotonik3d_r unrealized with the implemented heuristic.
> > It looks interesting.
> > I'll try with avx256_masked_epilougues to see if there's something unusual.
> Oh, no need for a new tune, avx512_masked_epilogues can directly be
> applied to those avx256_optimal avx512 processors, great!!!


Yes, it might be misnamed - it refers to the architectural masking
feature of AVX512 but extends to avx512vl, thus SSE and AVX2 vector
widths.  There's the possibility to add additional heuristics of course.

Richard.

> > >
> > > With the heuristics we turn 22513 vector epilogues + up to 12305 scalar
> > > epilogues into 12305 masked vector epilogues of which 574 are for
> > > AVX vector sizes, 79 for SSE vector sizes and the rest for AVX512.
> > > When masking all epilogues we get 14567 of them from
> > > 29467 vector + up to 14567 scalar epilogues, so the heuristics disable
> > > an additional 20% of masked epilogues.
> > >
> > > Bootstrapped and tested on x86_64-unknown-linux-gnu.
> > >
> > > OK?
> > >
> > > Thanks,
> > > Richard.
> > >
> > >         * config/i386/x86-tune.def (X86_TUNE_AVX512_MASKED_EPILOGUES):
> > >         New tunable, default on for m_ZNVER4 and m_ZNVER5.
> > >         * config/i386/i386.cc (ix86_vector_costs::finish_cost): With
> > >         X86_TUNE_AVX512_MASKED_EPILOGUES and when the main loop
> > >         had a vectorization factor > 2 use a masked epilogue when
> > >         possible and when not obviously problematic.
> > >
> > >         * gcc.target/i386/vect-mask-epilogue-1.c: New testcase.
> > >         * gcc.target/i386/vect-mask-epilogue-2.c: Likewise.
> > >         * gcc.target/i386/vect-epilogues-3.c: Adjust.
> > > ---
> > >  gcc/config/i386/i386.cc                       | 59 +++++++++++++++++++
> > >  gcc/config/i386/x86-tune.def                  |  5 ++
> > >  .../gcc.target/i386/vect-epilogues-3.c        |  2 +-
> > >  .../gcc.target/i386/vect-mask-epilogue-1.c    | 11 ++++
> > >  .../gcc.target/i386/vect-mask-epilogue-2.c    | 14 +++++
> > >  5 files changed, 90 insertions(+), 1 deletion(-)
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/vect-mask-epilogue-1.c
> > >  create mode 100644 gcc/testsuite/gcc.target/i386/vect-mask-epilogue-2.c
> > >
> > > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
> > > index b64175d6c93..8e796ea4033 100644
> > > --- a/gcc/config/i386/i386.cc
> > > +++ b/gcc/config/i386/i386.cc
> > > @@ -26295,6 +26295,65 @@ ix86_vector_costs::finish_cost (const 
> > > vector_costs *scalar_costs)
> > >        && LOOP_VINFO_VECT_FACTOR (loop_vinfo).to_constant () >= 16)
> > >      m_suggested_epilogue_mode = V8QImode;
> > >
> > > +  /* When X86_TUNE_AVX512_MASKED_EPILOGUES is enabled try to use
> > > +     a masked epilogue if that doesn't seem detrimental.  */
> > > +  if (loop_vinfo
> > > +      && !LOOP_VINFO_EPILOGUE_P (loop_vinfo)
> > > +      && LOOP_VINFO_VECT_FACTOR (loop_vinfo).to_constant () > 2
> > > +      && ix86_tune_features[X86_TUNE_AVX512_MASKED_EPILOGUES]
> > > +      && !OPTION_SET_P (param_vect_partial_vector_usage))
> > > +    {
> > > +      bool avoid = false;
> > > +      if (LOOP_VINFO_NITERS_KNOWN_P (loop_vinfo)
> > > +         && LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo) >= 0)
> > > +       {
> > > +         unsigned int peel_niter
> > > +           = LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo);
> > > +         if (LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo))
> > > +           peel_niter += 1;
> > > +         /* When we know the number of scalar iterations of the epilogue,
> > > +            avoid masking when a single vector epilog iteration handles
> > > +            it in full.  */
> > > +         if (pow2p_hwi ((LOOP_VINFO_INT_NITERS (loop_vinfo) - peel_niter)
> > > +                        % LOOP_VINFO_VECT_FACTOR 
> > > (loop_vinfo).to_constant ()))
> > > +           avoid = true;
> > > +       }
> > > +      if (!avoid && loop_outer (loop_outer (LOOP_VINFO_LOOP 
> > > (loop_vinfo))))
> > > +       for (auto ddr : LOOP_VINFO_DDRS (loop_vinfo))
> > > +         {
> > > +           if (DDR_ARE_DEPENDENT (ddr) == chrec_known)
> > > +             ;
> > > +           else if (DDR_ARE_DEPENDENT (ddr) == chrec_dont_know)
> > > +             ;
> > > +           else
> > > +             {
> > > +               int loop_depth
> > > +                   = index_in_loop_nest (LOOP_VINFO_LOOP 
> > > (loop_vinfo)->num,
> > > +                                         DDR_LOOP_NEST (ddr));
> > > +               if (DDR_NUM_DIST_VECTS (ddr) == 1
> > > +                   && DDR_DIST_VECTS (ddr)[0][loop_depth] == 0)
> > > +                 {
> > > +                   /* Avoid the case when there's an outer loop that 
> > > might
> > > +                      traverse a multi-dimensional array with the inner
> > > +                      loop just executing the masked epilogue with a
> > > +                      read-write where the next outer iteration might
> > > +                      read from the masked part of the previous write,
> > > +                      'n' filling half a vector.
> > > +                        for (j = 0; j < m; ++j)
> > > +                          for (i = 0; i < n; ++i)
> > > +                            a[j][i] = c * a[j][i];  */
> > > +                   avoid = true;
> > > +                   break;
> > > +                 }
> > > +             }
> > > +         }
> > > +      if (!avoid)
> > > +       {
> > > +         m_suggested_epilogue_mode = loop_vinfo->vector_mode;
> > > +         m_masked_epilogue = 1;
> > > +       }
> > > +    }
> > > +
> > >    vector_costs::finish_cost (scalar_costs);
> > >  }
> > >
> > > diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
> > > index 91cdca7fbfc..4773e5dd5ad 100644
> > > --- a/gcc/config/i386/x86-tune.def
> > > +++ b/gcc/config/i386/x86-tune.def
> > > @@ -639,6 +639,11 @@ DEF_TUNE (X86_TUNE_AVX512_STORE_BY_PIECES, 
> > > "avx512_store_by_pieces",
> > >  DEF_TUNE (X86_TUNE_AVX512_TWO_EPILOGUES, "avx512_two_epilogues",
> > >           m_ZNVER4 | m_ZNVER5)
> > >
> > > +/* X86_TUNE_AVX512_MAKED_EPILOGUES: Use two masked vector epilogues
> > > +   when fit.  */
> > > +DEF_TUNE (X86_TUNE_AVX512_MASKED_EPILOGUES, "avx512_masked_epilogues",
> > > +         m_ZNVER4 | m_ZNVER5)
> > > +
> > >  
> > > /*****************************************************************************/
> > >  
> > > /*****************************************************************************/
> > >  /* Historical relics: tuning flags that helps a specific old CPU designs 
> > >     */
> > > diff --git a/gcc/testsuite/gcc.target/i386/vect-epilogues-3.c 
> > > b/gcc/testsuite/gcc.target/i386/vect-epilogues-3.c
> > > index 0ee610f5e3e..e88ab30c770 100644
> > > --- a/gcc/testsuite/gcc.target/i386/vect-epilogues-3.c
> > > +++ b/gcc/testsuite/gcc.target/i386/vect-epilogues-3.c
> > > @@ -1,5 +1,5 @@
> > >  /* { dg-do compile } */
> > > -/* { dg-options "-O3 -mavx512bw -mtune=znver4 
> > > -fdump-tree-vect-optimized" } */
> > > +/* { dg-options "-O3 -mavx512bw -mtune=znver4 --param 
> > > vect-partial-vector-usage=0 -fdump-tree-vect-optimized" } */
> > >
> > >  int test (signed char *data, int n)
> > >  {
> > > diff --git a/gcc/testsuite/gcc.target/i386/vect-mask-epilogue-1.c 
> > > b/gcc/testsuite/gcc.target/i386/vect-mask-epilogue-1.c
> > > new file mode 100644
> > > index 00000000000..55519aa87fd
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/i386/vect-mask-epilogue-1.c
> > > @@ -0,0 +1,11 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-O3 -march=znver5 -fdump-tree-vect-optimized" } */
> > > +
> > > +void bar (double *a, double *b, double c, int n, int m)
> > > +{
> > > +  for (int j = 0; j < m; ++j)
> > > +    for (int i = 0; i < n; ++i)
> > > +      a[j*n + i] = b[j*n + i] + c;
> > > +}
> > > +
> > > +/* { dg-final { scan-tree-dump "epilogue loop vectorized using masked 64 
> > > byte vectors" "vect" } } */
> > > diff --git a/gcc/testsuite/gcc.target/i386/vect-mask-epilogue-2.c 
> > > b/gcc/testsuite/gcc.target/i386/vect-mask-epilogue-2.c
> > > new file mode 100644
> > > index 00000000000..3dc28b39b62
> > > --- /dev/null
> > > +++ b/gcc/testsuite/gcc.target/i386/vect-mask-epilogue-2.c
> > > @@ -0,0 +1,14 @@
> > > +/* { dg-do compile } */
> > > +/* { dg-options "-O3 -march=znver5 -fdump-tree-vect-optimized" } */
> > > +
> > > +void foo (double *a, double b, double c, int n, int m)
> > > +{
> > > +  for (int j = 0; j < m; ++j)
> > > +    for (int i = 0; i < n; ++i)
> > > +      a[j*n + i] = a[j*n + i] * b + c;
> > > +}
> > > +
> > > +/* We do not want to use a masked epilogue for the inner loop as the next
> > > +   outer iteration will possibly immediately read from elements masked of
> > > +   the previous inner loop epilogue and that never forwards.  */
> > > +/* { dg-final { scan-tree-dump "epilogue loop vectorized using 32 byte 
> > > vectors" "vect" } } */
> > > --
> > > 2.43.0
> >
> >
> >
> > --
> > BR,
> > Hongtao
> 
> 
> 
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Re: [PATCH 2/2] add masked-epilogue tuning

Reply via email to