> -----Original Message-----
> From: Richard Biener <[email protected]>
> Sent: Wednesday, September 17, 2025 2:55 PM
> To: Hongtao Liu <[email protected]>
> Cc: Liu, Hongtao <[email protected]>; [email protected];
> [email protected]
> Subject: Re: [PATCH] Remove SPR/GNR/DMR from avx512_{move,store}_by
> pieces tune.
>
> On Wed, Sep 17, 2025 at 7:59 AM Hongtao Liu <[email protected]> wrote:
> >
> > On Wed, Sep 17, 2025 at 2:12 PM Hongtao Liu <[email protected]>
> wrote:
> > >
> > > On Wed, Sep 17, 2025 at 2:08 PM Hongtao Liu <[email protected]>
> wrote:
> > > >
> > > > On Tue, Sep 16, 2025 at 3:53 PM Liu, Hongtao <[email protected]>
> wrote:
> > > > >
> > > > >
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Richard Biener <[email protected]>
> > > > > > Sent: Tuesday, September 16, 2025 3:03 PM
> > > > > > To: Liu, Hongtao <[email protected]>
> > > > > > Cc: [email protected]; [email protected]
> > > > > > Subject: Re: [PATCH] Remove SPR/GNR/DMR from
> > > > > > avx512_{move,store}_by pieces tune.
> > > > > >
> > > > > > On Tue, Sep 16, 2025 at 7:53 AM liuhongt <[email protected]>
> wrote:
> > > > > > >
> > > > > > > From: "hongtao.liu" <[email protected]>
> > > > > > >
> > > > > > > Align move_max with prefer_vector_width for SPR/GNR/DMR to
> > > > > > > avoid STLF
> > > > > > issue.
> > > > > > > It's similar as previous commit.
> > > > > > >
> > > > > > > commit 6ea25c041964bf63014fcf7bb68fb1f5a0a4e123
> > > > > > > Author: liuhongt <[email protected]>
> > > > > > > Date: Thu Aug 15 12:54:07 2024 +0800
> > > > > > >
> > > > > > > Align ix86_{move_max,store_max} with vectorizer.
> > > > > > >
> > > > > > > When none of mprefer-vector-width,
> avx256_optimal/avx128_optimal,
> > > > > > > avx256_store_by_pieces/avx512_store_by_pieces is specified,
> GCC will
> > > > > > > set ix86_{move_max,store_max} as max available vector length
> except
> > > > > > > for AVX part.
> > > > > > >
> > > > > > > if (TARGET_AVX512F_P (opts->x_ix86_isa_flags)
> > > > > > > && TARGET_EVEX512_P
> > > > > > > (opts->x_ix86_isa_flags2))
> > > > > > > opts->x_ix86_move_max = PVW_AVX512;
> > > > > > > else
> > > > > > > opts->x_ix86_move_max = PVW_AVX128;
> > > > > > >
> > > > > > > So for -mavx2, vectorizer will choose 256-bit for
> > > > > > > vectorization, but
> > > > > > > 128-bit is used for struct copy, there could be a potential
> > > > > > > STLF
> issue
> > > > > > > due to this "misalign".
> > > > > > >
> > > > > > >
> > > > > > > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.
> > > > > > > Ready push to trunk.
> > > > > >
> > > > > > Not an objection to this patch, but I wonder if we have too
> > > > > > many tunables where these kind of interactions cause more
> > > > > > problems than positives by having them?
> > > > > My experience is we should align MOVE_MAX with
> prefer_vector_width, vectorization and memcpy are 2 main parts which create
> vector load/store, they'd better align with each other.
> > > > > >
> > > > > > As general comment I'd say store-by-pieces with larger modes
> > > > > > is always good, but the read part for move-by-pieces can cause
> > > > > > STLF fails, not only from
> > > > > Yes, avx512_store_by_piece is still good, I'll adjust my patch.
> > > >
> > > > #define MOVE_MAX \
> > > > ((TARGET_AVX512F \
> > > > && (ix86_move_max == PVW_AVX512 \
> > > > || ix86_store_max == PVW_AVX512)) \
> > > > ? 64 \
> > > >
> > > > Since MOVE_MAX in x86 is also related to ix86_store_max, I'll
> > > > still remove SPR/GNR/DMR from avx512_store_by_pieces.
> > > > And why do we have ix86_store_max in MOVE_MAX?
> > > > I guess, it's because:
> > > > In the middle-end, the real size used for memset is decided by MIN
> > > > (MOVE_MAX_PIECES(alignment request), STORE_MAX_PIECES)
> > > >
> > > > Maybe we should remove the option mstore-max= and
> > > > {avx512,avx256}_store_by_pieces.
> > > > Since they eventually have the same impact as just setting
> ix86_move_max.
> > > Or maybe we should even remove -mmove-max and just set MOVE_MAX
> as
> > > prefer_vector_width due to STLF issue.
> > But for the double-pumping processor, store 256 followed by 512 load
> > probably ok, no STLF issue.
>
> Zen4 at least cannot forward this. Also even Zen5 has a 256bit wide store
> buffer, so stores are split. Curiously 512bit load/store can forward just
> fine
> (but not 2x256bit store -> 512bit load). In the past it was said the x86
> memory model makes any forwarding from multiple pieces "difficult" at least
> (though I don't remember any details).
>
> Can the Intel small cores forward 2xSSE store to 1xAVX load?
Yes for Sierraforest.(Guess it's because 1xAVX load is decoupled into 2xSSE
load)
>
> > So mmove-max could still be useful for that scenario.
> >
> > >
> > > >
> > > > >
> > > > > > vectorization but of course also from (non-contigous) scalar
> > > > > > stores performed by user code.
> > > > > >
> > > > > > I'm always advising CPU implementors to reduce STLF fail
> > > > > > penalties (or allow more forwarding cases). On the compilers
> > > > > > side we'd have to see to do less "local" (per stmt) decisions
> > > > > > on what modes to use but take into account context - the
> > > > > > post-fact fixup by the avoid-store-forwarding pass is likely not
> > > > > > very
> effective?
> > > > > I think the pass can't handle inter-procedure pointer analysis, and
> there're memcpy and vectorization in caller/callee with different size, worst
> case is small store in the caller followed by big load in the callee.
> > > > > >
> > > > > > > gcc/ChangeLog:
> > > > > > >
> > > > > > > * config/i386/x86-tune.def
> (X86_TUNE_AVX512_MOVE_BY_PIECES):
> > > > > > > (X86_TUNE_AVX512_STORE_BY_PIECES):
> > > > > > >
> > > > > > > gcc/testsuite/ChangeLog:
> > > > > > >
> > > > > > > * gcc.target/i386/pieces-memcpy-18.c: Use -mtune=znver5
> > > > > > > instead of -mtune=sapphirerapids.
> > > > > > > * gcc.target/i386/pieces-memcpy-21.c: Ditto.
> > > > > > > * gcc.target/i386/pieces-memset-46.c: Ditto.
> > > > > > > * gcc.target/i386/pieces-memset-49.c: Ditto.
> > > > > > > ---
> > > > > > > gcc/config/i386/x86-tune.def | 6 ++----
> > > > > > > gcc/testsuite/gcc.target/i386/pieces-memcpy-18.c | 2 +-
> > > > > > > gcc/testsuite/gcc.target/i386/pieces-memcpy-21.c | 2 +-
> > > > > > > gcc/testsuite/gcc.target/i386/pieces-memset-46.c | 2 +-
> > > > > > > gcc/testsuite/gcc.target/i386/pieces-memset-49.c | 2 +-
> > > > > > > 5 files changed, 6 insertions(+), 8 deletions(-)
> > > > > > >
> > > > > > > diff --git a/gcc/config/i386/x86-tune.def
> > > > > > > b/gcc/config/i386/x86-tune.def index
> > > > > > > a86cbad281c..8e642aa92a2 100644
> > > > > > > --- a/gcc/config/i386/x86-tune.def
> > > > > > > +++ b/gcc/config/i386/x86-tune.def
> > > > > > > @@ -625,14 +625,12 @@ DEF_TUNE
> > > > > > (X86_TUNE_AVX256_STORE_BY_PIECES,
> > > > > > > "avx256_store_by_pieces",
> > > > > > > /* X86_TUNE_AVX512_MOVE_BY_PIECES: Optimize
> move_by_pieces
> > > > > > > with
> > > > > > 512-bit
> > > > > > > AVX instructions. */
> > > > > > > DEF_TUNE (X86_TUNE_AVX512_MOVE_BY_PIECES,
> > > > > > "avx512_move_by_pieces",
> > > > > > > - m_SAPPHIRERAPIDS | m_GRANITERAPIDS |
> m_GRANITERAPIDS_D
> > > > > > > - | m_DIAMONDRAPIDS | m_ZNVER4 | m_ZNVER5)
> > > > > > > + m_ZNVER4 | m_ZNVER5)
> > > > > > >
> > > > > > > /* X86_TUNE_AVX512_STORE_BY_PIECES: Optimize
> > > > > > > store_by_pieces with
> > > > > > 512-bit
> > > > > > > AVX instructions. */
> > > > > > > DEF_TUNE (X86_TUNE_AVX512_STORE_BY_PIECES,
> > > > > > "avx512_store_by_pieces",
> > > > > > > - m_SAPPHIRERAPIDS | m_GRANITERAPIDS |
> m_GRANITERAPIDS_D
> > > > > > > - | m_DIAMONDRAPIDS | m_ZNVER4 | m_ZNVER5)
> > > > > > > + m_ZNVER4 | m_ZNVER5)
> > > > > > >
> > > > > > > /* X86_TUNE_AVX512_TWO_EPILOGUES: Use two vector epilogues
> > > > > > > for
> > > > > > 512-bit
> > > > > > > vectorized loops. */
> > > > > > > diff --git
> > > > > > > a/gcc/testsuite/gcc.target/i386/pieces-memcpy-18.c
> > > > > > > b/gcc/testsuite/gcc.target/i386/pieces-memcpy-18.c
> > > > > > > index b15a0db9ff0..b4995ac0598 100644
> > > > > > > --- a/gcc/testsuite/gcc.target/i386/pieces-memcpy-18.c
> > > > > > > +++ b/gcc/testsuite/gcc.target/i386/pieces-memcpy-18.c
> > > > > > > @@ -1,5 +1,5 @@
> > > > > > > /* { dg-do compile } */
> > > > > > > -/* { dg-options "-O2 -march=sapphirerapids" } */
> > > > > > > +/* { dg-options "-O2 -march=znver5" } */
> > > > > > >
> > > > > > > extern char *dst, *src;
> > > > > > >
> > > > > > > diff --git
> > > > > > > a/gcc/testsuite/gcc.target/i386/pieces-memcpy-21.c
> > > > > > > b/gcc/testsuite/gcc.target/i386/pieces-memcpy-21.c
> > > > > > > index ef439f20f74..804a2989d64 100644
> > > > > > > --- a/gcc/testsuite/gcc.target/i386/pieces-memcpy-21.c
> > > > > > > +++ b/gcc/testsuite/gcc.target/i386/pieces-memcpy-21.c
> > > > > > > @@ -1,5 +1,5 @@
> > > > > > > /* { dg-do compile } */
> > > > > > > -/* { dg-options "-O2 -mtune=sapphirerapids -march=x86-64
> > > > > > > -mavx2" } */
> > > > > > > +/* { dg-options "-O2 -mtune=znver5 -march=x86-64 -mavx2" }
> > > > > > > +*/
> > > > > > >
> > > > > > > extern char *dst, *src;
> > > > > > >
> > > > > > > diff --git
> > > > > > > a/gcc/testsuite/gcc.target/i386/pieces-memset-46.c
> > > > > > > b/gcc/testsuite/gcc.target/i386/pieces-memset-46.c
> > > > > > > index be1b054eed2..43d636ee3ff 100644
> > > > > > > --- a/gcc/testsuite/gcc.target/i386/pieces-memset-46.c
> > > > > > > +++ b/gcc/testsuite/gcc.target/i386/pieces-memset-46.c
> > > > > > > @@ -1,5 +1,5 @@
> > > > > > > /* { dg-do compile } */
> > > > > > > -/* { dg-options "-O2 -march=sapphirerapids" } */
> > > > > > > +/* { dg-options "-O2 -march=znver5" } */
> > > > > > >
> > > > > > > extern char *dst;
> > > > > > >
> > > > > > > diff --git
> > > > > > > a/gcc/testsuite/gcc.target/i386/pieces-memset-49.c
> > > > > > > b/gcc/testsuite/gcc.target/i386/pieces-memset-49.c
> > > > > > > index ad43f89a9bd..ca4933ac1d8 100644
> > > > > > > --- a/gcc/testsuite/gcc.target/i386/pieces-memset-49.c
> > > > > > > +++ b/gcc/testsuite/gcc.target/i386/pieces-memset-49.c
> > > > > > > @@ -1,5 +1,5 @@
> > > > > > > /* { dg-do compile } */
> > > > > > > -/* { dg-options "-O2 -mtune=sapphirerapids -march=x86-64
> > > > > > > -mavx2" } */
> > > > > > > +/* { dg-options "-O2 -mtune=znver5 -march=x86-64 -mavx2" }
> > > > > > > +*/
> > > > > > >
> > > > > > > extern char *dst;
> > > > > > >
> > > > > > > --
> > > > > > > 2.34.1
> > > > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > BR,
> > > > Hongtao
> > >
> > >
> > >
> > > --
> > > BR,
> > > Hongtao
> >
> >
> >
> > --
> > BR,
> > Hongtao