On Tue, Oct 28, 2025 at 9:34 PM Kumar, Venkataramanan <[email protected]> wrote: > > [Public] > > Hi Liu, > > > -----Original Message----- > > From: Richard Biener <[email protected]> > > Sent: Tuesday, September 16, 2025 2:29 PM > > To: Liu, Hongtao <[email protected]> > > Cc: [email protected]; [email protected] > > Subject: Re: [PATCH] Remove SPR/GNR/DMR from avx512_{move, store}_by > > pieces tune. > > > > Caution: This message originated from an External Source. Use proper caution > > when opening attachments, clicking links, or responding. > > > > > > On Tue, Sep 16, 2025 at 9:53 AM Liu, Hongtao <[email protected]> wrote: > > > > > > > > > > > > > -----Original Message----- > > > > From: Richard Biener <[email protected]> > > > > Sent: Tuesday, September 16, 2025 3:03 PM > > > > To: Liu, Hongtao <[email protected]> > > > > Cc: [email protected]; [email protected] > > > > Subject: Re: [PATCH] Remove SPR/GNR/DMR from avx512_{move,store}_by > > > > pieces tune. > > > > > > > > On Tue, Sep 16, 2025 at 7:53 AM liuhongt <[email protected]> wrote: > > > > > > > > > > From: "hongtao.liu" <[email protected]> > > > > > > > > > > Align move_max with prefer_vector_width for SPR/GNR/DMR to avoid > > > > > STLF > > > > issue. > > > > > It's similar as previous commit. > > > > > > > > > > commit 6ea25c041964bf63014fcf7bb68fb1f5a0a4e123 > > > > > Author: liuhongt <[email protected]> > > > > > Date: Thu Aug 15 12:54:07 2024 +0800 > > > > > > > > > > Align ix86_{move_max,store_max} with vectorizer. > > > > > > > > > > When none of mprefer-vector-width, avx256_optimal/avx128_optimal, > > > > > avx256_store_by_pieces/avx512_store_by_pieces is specified, GCC > > > > > will > > > > > set ix86_{move_max,store_max} as max available vector length > > > > > except > > > > > for AVX part. > > > > > > > > > > if (TARGET_AVX512F_P (opts->x_ix86_isa_flags) > > > > > && TARGET_EVEX512_P (opts->x_ix86_isa_flags2)) > > > > > opts->x_ix86_move_max = PVW_AVX512; > > > > > else > > > > > opts->x_ix86_move_max = PVW_AVX128; > > > > > > > > > > So for -mavx2, vectorizer will choose 256-bit for vectorization, > > > > > but > > > > > 128-bit is used for struct copy, there could be a potential STLF > > > > > issue > > > > > due to this "misalign". > > > > > > > > > > > > > > > Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. > > > > > Ready push to trunk. > > > > > > > > Not an objection to this patch, but I wonder if we have too many > > > > tunables where these kind of interactions cause more problems than > > > > positives by having them? > > > My experience is we should align MOVE_MAX with prefer_vector_width, > > vectorization and memcpy are 2 main parts which create vector load/store, > > they'd better align with each other. > > > > > > > > As general comment I'd say store-by-pieces with larger modes is > > > > always good, but the read part for move-by-pieces can cause STLF > > > > fails, not only from > > > Yes, avx512_store_by_piece is still good, I'll adjust my patch. > > Sorry for asking this late. > > The "X86_TUNE_AVX512_MOVE_BY_PIECES" is deprecated in staging. > My understanding is that X86_TUNE_AVX512_MOVE_BY_PIECES is now used to tell > the compiler to use 512 vectors for "memset" as well as "memcpy". Also, this > tuning (move_by_pieces) can also lead to STLF stalls, when there are narrow > store and then this tuning subsequently creates wider loads. > > How do I avoid these stalls but still ask GCC to use 512 bit "memset" for my > targets. You can use -mprefer-vector-width=512 to enable zmm for both vectorizer and memcpy/memset. 512-bit vector performance is good insce SPR.(We are planning to enable it by default.) > > Regards, > Venkat. > > > > > > > vectorization but of course also from (non-contigous) scalar stores > > > > performed by user code. > > > > > > > > I'm always advising CPU implementors to reduce STLF fail penalties > > > > (or allow more forwarding cases). On the compilers side we'd have > > > > to see to do less "local" (per stmt) decisions on what modes to use > > > > but take into account context - the post-fact fixup by the > > > > avoid-store-forwarding pass is likely not very effective? > > > I think the pass can't handle inter-procedure pointer analysis, and > > > there're > > memcpy and vectorization in caller/callee with different size, worst case is > > small store in the caller followed by big load in the callee. > > > > True, IIRC this happened in 538.imagick_r at some point. > > > > There's also __attribute__((target)), so caller/callee could disagree about > > vector > > ISA (and/or size preference). But I guess that's not happening too often > > to be of > > concern. > > > > Richard. > > > > > > > > > > > gcc/ChangeLog: > > > > > > > > > > * config/i386/x86-tune.def (X86_TUNE_AVX512_MOVE_BY_PIECES): > > > > > (X86_TUNE_AVX512_STORE_BY_PIECES): > > > > > > > > > > gcc/testsuite/ChangeLog: > > > > > > > > > > * gcc.target/i386/pieces-memcpy-18.c: Use -mtune=znver5 > > > > > instead of -mtune=sapphirerapids. > > > > > * gcc.target/i386/pieces-memcpy-21.c: Ditto. > > > > > * gcc.target/i386/pieces-memset-46.c: Ditto. > > > > > * gcc.target/i386/pieces-memset-49.c: Ditto. > > > > > --- > > > > > gcc/config/i386/x86-tune.def | 6 ++---- > > > > > gcc/testsuite/gcc.target/i386/pieces-memcpy-18.c | 2 +- > > > > > gcc/testsuite/gcc.target/i386/pieces-memcpy-21.c | 2 +- > > > > > gcc/testsuite/gcc.target/i386/pieces-memset-46.c | 2 +- > > > > > gcc/testsuite/gcc.target/i386/pieces-memset-49.c | 2 +- > > > > > 5 files changed, 6 insertions(+), 8 deletions(-) > > > > > > > > > > diff --git a/gcc/config/i386/x86-tune.def > > > > > b/gcc/config/i386/x86-tune.def index a86cbad281c..8e642aa92a2 > > > > > 100644 > > > > > --- a/gcc/config/i386/x86-tune.def > > > > > +++ b/gcc/config/i386/x86-tune.def > > > > > @@ -625,14 +625,12 @@ DEF_TUNE > > > > (X86_TUNE_AVX256_STORE_BY_PIECES, > > > > > "avx256_store_by_pieces", > > > > > /* X86_TUNE_AVX512_MOVE_BY_PIECES: Optimize move_by_pieces > > with > > > > 512-bit > > > > > AVX instructions. */ > > > > > DEF_TUNE (X86_TUNE_AVX512_MOVE_BY_PIECES, > > > > "avx512_move_by_pieces", > > > > > - m_SAPPHIRERAPIDS | m_GRANITERAPIDS | m_GRANITERAPIDS_D > > > > > - | m_DIAMONDRAPIDS | m_ZNVER4 | m_ZNVER5) > > > > > + m_ZNVER4 | m_ZNVER5) > > > > > > > > > > /* X86_TUNE_AVX512_STORE_BY_PIECES: Optimize store_by_pieces with > > > > 512-bit > > > > > AVX instructions. */ > > > > > DEF_TUNE (X86_TUNE_AVX512_STORE_BY_PIECES, > > > > "avx512_store_by_pieces", > > > > > - m_SAPPHIRERAPIDS | m_GRANITERAPIDS | m_GRANITERAPIDS_D > > > > > - | m_DIAMONDRAPIDS | m_ZNVER4 | m_ZNVER5) > > > > > + m_ZNVER4 | m_ZNVER5) > > > > > > > > > > /* X86_TUNE_AVX512_TWO_EPILOGUES: Use two vector epilogues for > > > > 512-bit > > > > > vectorized loops. */ > > > > > diff --git a/gcc/testsuite/gcc.target/i386/pieces-memcpy-18.c > > > > > b/gcc/testsuite/gcc.target/i386/pieces-memcpy-18.c > > > > > index b15a0db9ff0..b4995ac0598 100644 > > > > > --- a/gcc/testsuite/gcc.target/i386/pieces-memcpy-18.c > > > > > +++ b/gcc/testsuite/gcc.target/i386/pieces-memcpy-18.c > > > > > @@ -1,5 +1,5 @@ > > > > > /* { dg-do compile } */ > > > > > -/* { dg-options "-O2 -march=sapphirerapids" } */ > > > > > +/* { dg-options "-O2 -march=znver5" } */ > > > > > > > > > > extern char *dst, *src; > > > > > > > > > > diff --git a/gcc/testsuite/gcc.target/i386/pieces-memcpy-21.c > > > > > b/gcc/testsuite/gcc.target/i386/pieces-memcpy-21.c > > > > > index ef439f20f74..804a2989d64 100644 > > > > > --- a/gcc/testsuite/gcc.target/i386/pieces-memcpy-21.c > > > > > +++ b/gcc/testsuite/gcc.target/i386/pieces-memcpy-21.c > > > > > @@ -1,5 +1,5 @@ > > > > > /* { dg-do compile } */ > > > > > -/* { dg-options "-O2 -mtune=sapphirerapids -march=x86-64 -mavx2" > > > > > } */ > > > > > +/* { dg-options "-O2 -mtune=znver5 -march=x86-64 -mavx2" } */ > > > > > > > > > > extern char *dst, *src; > > > > > > > > > > diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-46.c > > > > > b/gcc/testsuite/gcc.target/i386/pieces-memset-46.c > > > > > index be1b054eed2..43d636ee3ff 100644 > > > > > --- a/gcc/testsuite/gcc.target/i386/pieces-memset-46.c > > > > > +++ b/gcc/testsuite/gcc.target/i386/pieces-memset-46.c > > > > > @@ -1,5 +1,5 @@ > > > > > /* { dg-do compile } */ > > > > > -/* { dg-options "-O2 -march=sapphirerapids" } */ > > > > > +/* { dg-options "-O2 -march=znver5" } */ > > > > > > > > > > extern char *dst; > > > > > > > > > > diff --git a/gcc/testsuite/gcc.target/i386/pieces-memset-49.c > > > > > b/gcc/testsuite/gcc.target/i386/pieces-memset-49.c > > > > > index ad43f89a9bd..ca4933ac1d8 100644 > > > > > --- a/gcc/testsuite/gcc.target/i386/pieces-memset-49.c > > > > > +++ b/gcc/testsuite/gcc.target/i386/pieces-memset-49.c > > > > > @@ -1,5 +1,5 @@ > > > > > /* { dg-do compile } */ > > > > > -/* { dg-options "-O2 -mtune=sapphirerapids -march=x86-64 -mavx2" > > > > > } */ > > > > > +/* { dg-options "-O2 -mtune=znver5 -march=x86-64 -mavx2" } */ > > > > > > > > > > extern char *dst; > > > > > > > > > > -- > > > > > 2.34.1 > > > > >
-- BR, Hongtao
