DMR from avx512_{move, store}_by pieces tune.

Hongtao Liu Tue, 28 Oct 2025 22:02:01 -0700

On Tue, Oct 28, 2025 at 9:34 PM Kumar, Venkataramanan
<[email protected]> wrote:
>
> [Public]
>
> Hi Liu,
>
> > -----Original Message-----
> > From: Richard Biener <[email protected]>
> > Sent: Tuesday, September 16, 2025 2:29 PM
> > To: Liu, Hongtao <[email protected]>
> > Cc: [email protected]; [email protected]
> > Subject: Re: [PATCH] Remove SPR/GNR/DMR from avx512_{move, store}_by
> > pieces tune.
> >
> > Caution: This message originated from an External Source. Use proper caution
> > when opening attachments, clicking links, or responding.
> >
> >
> > On Tue, Sep 16, 2025 at 9:53 AM Liu, Hongtao <[email protected]> wrote:
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Richard Biener <[email protected]>
> > > > Sent: Tuesday, September 16, 2025 3:03 PM
> > > > To: Liu, Hongtao <[email protected]>
> > > > Cc: [email protected]; [email protected]
> > > > Subject: Re: [PATCH] Remove SPR/GNR/DMR from avx512_{move,store}_by
> > > > pieces tune.
> > > >
> > > > On Tue, Sep 16, 2025 at 7:53 AM liuhongt <[email protected]> wrote:
> > > > >
> > > > > From: "hongtao.liu" <[email protected]>
> > > > >
> > > > > Align move_max with prefer_vector_width for SPR/GNR/DMR to avoid
> > > > > STLF
> > > > issue.
> > > > > It's similar as previous commit.
> > > > >
> > > > > commit 6ea25c041964bf63014fcf7bb68fb1f5a0a4e123
> > > > > Author: liuhongt <[email protected]>
> > > > > Date:   Thu Aug 15 12:54:07 2024 +0800
> > > > >
> > > > >     Align ix86_{move_max,store_max} with vectorizer.
> > > > >
> > > > >     When none of mprefer-vector-width, avx256_optimal/avx128_optimal,
> > > > >     avx256_store_by_pieces/avx512_store_by_pieces is specified, GCC 
> > > > > will
> > > > >     set ix86_{move_max,store_max} as max available vector length 
> > > > > except
> > > > >     for AVX part.
> > > > >
> > > > >                   if (TARGET_AVX512F_P (opts->x_ix86_isa_flags)
> > > > >                       && TARGET_EVEX512_P (opts->x_ix86_isa_flags2))
> > > > >                     opts->x_ix86_move_max = PVW_AVX512;
> > > > >                   else
> > > > >                     opts->x_ix86_move_max = PVW_AVX128;
> > > > >
> > > > >     So for -mavx2, vectorizer will choose 256-bit for vectorization, 
> > > > > but
> > > > >     128-bit is used for struct copy, there could be a potential STLF 
> > > > > issue
> > > > >     due to this "misalign".
> > > > >
> > > > >
> > > > > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> > > > > Ready push to trunk.
> > > >
> > > > Not an objection to this patch, but I wonder if we have too many
> > > > tunables where these kind of interactions cause more problems than
> > > > positives by having them?
> > > My experience is we should align MOVE_MAX with prefer_vector_width,
> > vectorization and memcpy are 2 main parts which create vector load/store,
> > they'd better align with each other.
> > > >
> > > > As general comment I'd say store-by-pieces with larger modes is
> > > > always good, but the read part for move-by-pieces can cause STLF
> > > > fails, not only from
> > > Yes, avx512_store_by_piece is still good, I'll adjust my patch.
>
> Sorry for asking this late.
>
> The "X86_TUNE_AVX512_MOVE_BY_PIECES" is deprecated in staging.
> My understanding is that X86_TUNE_AVX512_MOVE_BY_PIECES is now used to tell 
> the compiler to use 512 vectors for "memset" as well as "memcpy". Also, this 
> tuning (move_by_pieces) can also lead to STLF stalls,  when there are narrow 
> store and then this tuning subsequently creates wider loads.
>
> How do I avoid these stalls but still ask GCC to use 512 bit "memset" for my 
> targets.
You can use -mprefer-vector-width=512 to enable zmm for both
vectorizer and memcpy/memset.
512-bit vector performance is good insce SPR.(We are planning to
enable it by default.)
>
> Regards,
> Venkat.
> > >
> > > > vectorization but of course also from (non-contigous) scalar stores
> > > > performed by user code.
> > > >
> > > > I'm always advising CPU implementors to reduce STLF fail penalties
> > > > (or allow more forwarding cases).  On the compilers side we'd have
> > > > to see to do less "local" (per stmt) decisions on what modes to use
> > > > but take into account context - the post-fact fixup by the
> > > > avoid-store-forwarding pass is likely not very effective?
> > > I think the pass can't handle inter-procedure pointer analysis, and 
> > > there're
> > memcpy and vectorization in caller/callee with different size, worst case is
> > small store in the caller followed by big load in the callee.
> >
> > True, IIRC this happened in 538.imagick_r at some point.
> >
> > There's also __attribute__((target)), so caller/callee could disagree about 
> > vector
> > ISA (and/or size preference).  But I guess that's not happening too often 
> > to be of
> > concern.
> >
> > Richard.
> >
> > > >
> > > > > gcc/ChangeLog:
> > > > >
> > > > >         * config/i386/x86-tune.def (X86_TUNE_AVX512_MOVE_BY_PIECES):
> > > > >         (X86_TUNE_AVX512_STORE_BY_PIECES):
> > > > >
> > > > > gcc/testsuite/ChangeLog:
> > > > >
> > > > >         * gcc.target/i386/pieces-memcpy-18.c: Use -mtune=znver5
> > > > >         instead of -mtune=sapphirerapids.
> > > > >         * gcc.target/i386/pieces-memcpy-21.c: Ditto.
> > > > >         * gcc.target/i386/pieces-memset-46.c: Ditto.
> > > > >         * gcc.target/i386/pieces-memset-49.c: Ditto.
> > > > > ---
> > > > >  gcc/config/i386/x86-tune.def                     | 6 ++----
> > > > >  gcc/testsuite/gcc.target/i386/pieces-memcpy-18.c | 2 +-
> > > > > gcc/testsuite/gcc.target/i386/pieces-memcpy-21.c | 2 +-
> > > > > gcc/testsuite/gcc.target/i386/pieces-memset-46.c | 2 +-
> > > > > gcc/testsuite/gcc.target/i386/pieces-memset-49.c | 2 +-
> > > > >  5 files changed, 6 insertions(+), 8 deletions(-)
> > > > >
> > > > > diff --git a/gcc/config/i386/x86-tune.def
> > > > > b/gcc/config/i386/x86-tune.def index a86cbad281c..8e642aa92a2
> > > > > 100644
> > > > > --- a/gcc/config/i386/x86-tune.def
> > > > > +++ b/gcc/config/i386/x86-tune.def
> > > > > @@ -625,14 +625,12 @@ DEF_TUNE
> > > > (X86_TUNE_AVX256_STORE_BY_PIECES,
> > > > > "avx256_store_by_pieces",
> > > > >  /* X86_TUNE_AVX512_MOVE_BY_PIECES: Optimize move_by_pieces
> > with
> > > > 512-bit
> > > > >     AVX instructions.  */
> > > > >  DEF_TUNE (X86_TUNE_AVX512_MOVE_BY_PIECES,
> > > > "avx512_move_by_pieces",
> > > > > -         m_SAPPHIRERAPIDS | m_GRANITERAPIDS | m_GRANITERAPIDS_D
> > > > > -         | m_DIAMONDRAPIDS | m_ZNVER4 | m_ZNVER5)
> > > > > +          m_ZNVER4 | m_ZNVER5)
> > > > >
> > > > >  /* X86_TUNE_AVX512_STORE_BY_PIECES: Optimize store_by_pieces with
> > > > 512-bit
> > > > >     AVX instructions.  */
> > > > >  DEF_TUNE (X86_TUNE_AVX512_STORE_BY_PIECES,
> > > > "avx512_store_by_pieces",
> > > > > -         m_SAPPHIRERAPIDS | m_GRANITERAPIDS | m_GRANITERAPIDS_D
> > > > > -         | m_DIAMONDRAPIDS | m_ZNVER4 | m_ZNVER5)
> > > > > +          m_ZNVER4 | m_ZNVER5)
> > > > >
> > > > >  /* X86_TUNE_AVX512_TWO_EPILOGUES: Use two vector epilogues for
> > > > 512-bit
> > > > >     vectorized loops.  */
> > > > > diff --git a/gcc/testsuite/gcc.target/i386/pieces-memcpy-18.c
> > > > > b/gcc/testsuite/gcc.target/i386/pieces-memcpy-18.c
> > > > > index b15a0db9ff0..b4995ac0598 100644
> > > > > --- a/gcc/testsuite/gcc.target/i386/pieces-memcpy-18.c
> > > > > +++ b/gcc/testsuite/gcc.target/i386/pieces-memcpy-18.c
> > > > > @@ -1,5 +1,5 @@
> > > > >  /* { dg-do compile } */
> > > > > -/* { dg-options "-O2 -march=sapphirerapids" } */
> > > > > +/* { dg-options "-O2 -march=znver5" } */
> > > > >
> > > > >  extern char *dst, *src;
> > > > >
> > > > > diff --git a/gcc/testsuite/gcc.target/i386/pieces-memcpy-21.c
> > > > > b/gcc/testsuite/gcc.target/i386/pieces-memcpy-21.c
> > > > > index ef439f20f74..804a2989d64 100644
> > > > > --- a/gcc/testsuite/gcc.target/i386/pieces-memcpy-21.c
> > > > > +++ b/gcc/testsuite/gcc.target/i386/pieces-memcpy-21.c
> > > > > @@ -1,5 +1,5 @@
> > > > >  /* { dg-do compile } */
> > > > > -/* { dg-options "-O2 -mtune=sapphirerapids -march=x86-64 -mavx2"
> > > > > } */
> > > > > +/* { dg-options "-O2 -mtune=znver5 -march=x86-64 -mavx2" } */
> > > > >
> > > > >  extern char *dst, *src;
> > > > >
> > > > > diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-46.c
> > > > > b/gcc/testsuite/gcc.target/i386/pieces-memset-46.c
> > > > > index be1b054eed2..43d636ee3ff 100644
> > > > > --- a/gcc/testsuite/gcc.target/i386/pieces-memset-46.c
> > > > > +++ b/gcc/testsuite/gcc.target/i386/pieces-memset-46.c
> > > > > @@ -1,5 +1,5 @@
> > > > >  /* { dg-do compile } */
> > > > > -/* { dg-options "-O2 -march=sapphirerapids" } */
> > > > > +/* { dg-options "-O2 -march=znver5" } */
> > > > >
> > > > >  extern char *dst;
> > > > >
> > > > > diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-49.c
> > > > > b/gcc/testsuite/gcc.target/i386/pieces-memset-49.c
> > > > > index ad43f89a9bd..ca4933ac1d8 100644
> > > > > --- a/gcc/testsuite/gcc.target/i386/pieces-memset-49.c
> > > > > +++ b/gcc/testsuite/gcc.target/i386/pieces-memset-49.c
> > > > > @@ -1,5 +1,5 @@
> > > > >  /* { dg-do compile } */
> > > > > -/* { dg-options "-O2 -mtune=sapphirerapids -march=x86-64 -mavx2"
> > > > > } */
> > > > > +/* { dg-options "-O2 -mtune=znver5 -march=x86-64 -mavx2" } */
> > > > >
> > > > >  extern char *dst;
> > > > >
> > > > > --
> > > > > 2.34.1
> > > > >




-- 
BR,
Hongtao

Re: [PATCH] Remove SPR/GNR/DMR from avx512_{move, store}_by pieces tune.

Reply via email to