> -----Original Message-----
> From: H.J. Lu <hjl.to...@gmail.com>
> Sent: Monday, June 16, 2025 10:08 PM
> To: Jan Hubicka <hubi...@ucw.cz>
> Cc: Uros Bizjak <ubiz...@gmail.com>; Cui, Lili <lili....@intel.com>; gcc-
> patc...@gcc.gnu.org; Liu, Hongtao <hongtao....@intel.com>;
> mjgu...@gmail.com
> Subject: [PATCH v3] x86: Update memcpy/memset inline strategies for -
> mtune=generic
> 
> On Mon, Jun 16, 2025 at 12:19 AM Jan Hubicka <hubi...@ucw.cz> wrote:
> >
> > >
> > > Perhaps someone is interested in the following thread from LKML:
> > >
> > > "[PATCH v2] x86: prevent gcc from emitting rep movsq/stosq for inlined
> ops"
> > >
> > > https://lore.kernel.org/lkml/20250605164733.737543-1-mjguzik@gmail.c
> > > om/
> > >
> > > There are several PRs regarding memcpy/memset linked from the above
> message.
> > >
> > > Please also note a message from Linus from the above thread:
> > >
> > > https://lore.kernel.org/lkml/CAHk-=wg1qQLWKPyvxxZnXwboT48--
> LKJuCJjF8
> > > phdhrxv0u...@mail.gmail.com/
> >
> > This is my understanding of the situation.
> > Please correct me where I am wrong.
> >
> > According to Linus, the calls in kernel are more expensive then
> > elsewhere due to mitigations.  I wonder if -minline-all-stringops
> > would make sense here.
> >
> > Linus writes about the alternate entryopint for memcpy with
> > non-standard calling convention, which we also discussed few times in the
> past.
> > I think having call convention for memset/memcpy that only clobbers
> > SI/DE/CX and nothing else (especially no SSE regs) makes sense.
> >
> > This should make offlined mempcy noticeably cheaper, specially when
> > called from loops that needs SSE and the implmentation can be done w/o
> > cloberring extra registers for small blocks while it will have enoug
> > time to spill for large ones.
> >
> > The other patch does
> > +KBUILD_CFLAGS +=
> > +-mmemcpy-strategy=unrolled_loop:256:noalign,libcall:-1:noalign
> > +KBUILD_CFLAGS +=
> > +-mmemset-strategy=unrolled_loop:256:noalign,libcall:-1:noalign
> > for non-native CPUs (so something we should fix for generic tuning).
> >
> > Which is about our current default to rep stosq that does not work
> > well on Intel hardware. We do loop for blocks up to 32bytes and rep
> > stosq up to 8k.
> >
> > We now have X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB for Intel
> cores, but
> > no changes for generic yet (it is on my TODO to do some more testing
> > on Zen).
> >
> > So I think we can do following:
> >   1) decide whether to go with
> X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB
> >      or relpace rep_prefix_8_byte by unrolled_loop
> >   2) fix issue with repeated constants. I.e. instead
> >
> >        movq $0, ....
> >        movq $0, ....
> >        ....
> >        movq $0, ....
> >       Which we currently generate for memset fitting in CLEAR_RATIO by
> >        mov $0, tmpreg
> >        movq tmpreg, ....
> >        movq tmpreg, ....
> >        ....
> >        movq tmpreg, ....
> >       Which will make memset sequences smaller.  I agree with Richi that 
> > HJ's
> >       patch that adds new cloar block expander is probably not a right place
> >       for solving the problem.
> >
> >       Ideall we should catch repeated constants more generally since
> >       this appears elsewhere too.
> >       I am not quite sure where to fit it best.  We already have a
> >       machine specific task that loads 0 into SSE register which is kind
> >       of similar to this as well.
> >   3) Figure out what are reasonable MOVE_RATIO/CLEAR_RATIO defaults
> >   4) Possibly go with the entry point idea?
> > Honza
> 
> Here is the v3 patch.  It no longer uses "rep mov/stos".   Lili, can you 
> measure
> its performance impact on Intel and AMD cpus?
> 

Option: -march=x86-64-v3 -mtune=generic -O2
For spec-fewflow (Git clone https://github.com/kronbichler/spec-femflow.git):  
The speed improved by 38% on ZNVER5 and 26% on ADL (p core),  the code size 
increased by 0.29%.
For CPU2017: For ZNVER5 and ICELAKE, there is almost no impact on performance 
and code size,  only the code size of 520.omnetpp increased by 1.8%.
For latest Linux kernel: The code size increased by 0.32%. 

Lili.

> The updated generic has
> 
> Update memcpy and memset inline strategies for -mtune=generic:
> 
> 1. Don't align memory.
> 2. For known sizes, unroll loop with 4 moves or stores per iteration
>    without aligning the loop, up to 256 bytes.
> 3. For unknown sizes, use memcpy/memset.
> 4. Since each loop iteration has 4 stores and 8 stores for zeroing with
>    unroll loop may be needed, change CLEAR_RATIO to 10 so that zeroing
>    up to 72 bytes are fully unrolled with 9 stores without SSE.
> 
> Use move_by_pieces and store_by_pieces for memcpy and memset epilogues
> with the fixed epilogue size to enable overlapping moves and stores.
> 
> gcc/
> 
> PR target/102294
> PR target/119596
> PR target/119703
> PR target/119704
> * builtins.cc (builtin_memset_gen_str): Make it global.
> * builtins.h (builtin_memset_gen_str): New.
> * config/i386/i386-expand.cc (expand_cpymem_epilogue): Use
> move_by_pieces.
> (expand_setmem_epilogue): Use store_by_pieces.
> (ix86_expand_set_or_cpymem): Pass val_exp, instead of vec_promoted_val,
> to expand_setmem_epilogue.
> * config/i386/x86-tune-costs.h (generic_memcpy): Updated.
> (generic_memset): Likewise.
> (generic_cost): Change CLEAR_RATIO to 10.
> 
> gcc/testsuite/
> 
> PR target/102294
> PR target/119596
> PR target/119703
> PR target/119704
> * gcc.target/i386/auto-init-padding-3.c: Expect XMM stores.
> * gcc.target/i386/auto-init-padding-9.c: Expect loop.
> * gcc.target/i386/memcpy-strategy-12.c: New test.
> * gcc.target/i386/memcpy-strategy-13.c: Likewise.
> * gcc.target/i386/memset-strategy-25.c: Likewise.
> * gcc.target/i386/memset-strategy-26.c: Likewise.
> * gcc.target/i386/memset-strategy-27.c: Likewise.
> * gcc.target/i386/memset-strategy-28.c: Likewise.
> * gcc.target/i386/memset-strategy-29.c: Likewise.
> * gcc.target/i386/memset-strategy-30.c: Likewise.
> * gcc.target/i386/memset-strategy-31.c: Likewise.
> * gcc.target/i386/mvc17.c: Fail with "rep mov"
> * gcc.target/i386/pr111657-1.c: Scan for unrolled loop.  Fail with "rep mov".
> * gcc.target/i386/shrink_wrap_1.c: Also pass -mmemset-
> strategy=rep_8byte:-1:align.
> * gcc.target/i386/sw-1.c: Also pass -mstringop-strategy=rep_byte.
> 
> 
> --
> H.J.

Reply via email to