On Tue, Jun 17, 2025 at 8:54 PM Cui, Lili <lili....@intel.com> wrote:
>
>
>
> > -----Original Message-----
> > From: H.J. Lu <hjl.to...@gmail.com>
> > Sent: Monday, June 16, 2025 10:08 PM
> > To: Jan Hubicka <hubi...@ucw.cz>
> > Cc: Uros Bizjak <ubiz...@gmail.com>; Cui, Lili <lili....@intel.com>; gcc-
> > patc...@gcc.gnu.org; Liu, Hongtao <hongtao....@intel.com>;
> > mjgu...@gmail.com
> > Subject: [PATCH v3] x86: Update memcpy/memset inline strategies for -
> > mtune=generic
> >
> > On Mon, Jun 16, 2025 at 12:19 AM Jan Hubicka <hubi...@ucw.cz> wrote:
> > >
> > > >
> > > > Perhaps someone is interested in the following thread from LKML:
> > > >
> > > > "[PATCH v2] x86: prevent gcc from emitting rep movsq/stosq for inlined
> > ops"
> > > >
> > > > https://lore.kernel.org/lkml/20250605164733.737543-1-mjguzik@gmail.c
> > > > om/
> > > >
> > > > There are several PRs regarding memcpy/memset linked from the above
> > message.
> > > >
> > > > Please also note a message from Linus from the above thread:
> > > >
> > > > https://lore.kernel.org/lkml/CAHk-=wg1qQLWKPyvxxZnXwboT48--
> > LKJuCJjF8
> > > > phdhrxv0u...@mail.gmail.com/
> > >
> > > This is my understanding of the situation.
> > > Please correct me where I am wrong.
> > >
> > > According to Linus, the calls in kernel are more expensive then
> > > elsewhere due to mitigations.  I wonder if -minline-all-stringops
> > > would make sense here.
> > >
> > > Linus writes about the alternate entryopint for memcpy with
> > > non-standard calling convention, which we also discussed few times in the
> > past.
> > > I think having call convention for memset/memcpy that only clobbers
> > > SI/DE/CX and nothing else (especially no SSE regs) makes sense.
> > >
> > > This should make offlined mempcy noticeably cheaper, specially when
> > > called from loops that needs SSE and the implmentation can be done w/o
> > > cloberring extra registers for small blocks while it will have enoug
> > > time to spill for large ones.
> > >
> > > The other patch does
> > > +KBUILD_CFLAGS +=
> > > +-mmemcpy-strategy=unrolled_loop:256:noalign,libcall:-1:noalign
> > > +KBUILD_CFLAGS +=
> > > +-mmemset-strategy=unrolled_loop:256:noalign,libcall:-1:noalign
> > > for non-native CPUs (so something we should fix for generic tuning).
> > >
> > > Which is about our current default to rep stosq that does not work
> > > well on Intel hardware. We do loop for blocks up to 32bytes and rep
> > > stosq up to 8k.
> > >
> > > We now have X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB for Intel
> > cores, but
> > > no changes for generic yet (it is on my TODO to do some more testing
> > > on Zen).
> > >
> > > So I think we can do following:
> > >   1) decide whether to go with
> > X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB
> > >      or relpace rep_prefix_8_byte by unrolled_loop
> > >   2) fix issue with repeated constants. I.e. instead
> > >
> > >        movq $0, ....
> > >        movq $0, ....
> > >        ....
> > >        movq $0, ....
> > >       Which we currently generate for memset fitting in CLEAR_RATIO by
> > >        mov $0, tmpreg
> > >        movq tmpreg, ....
> > >        movq tmpreg, ....
> > >        ....
> > >        movq tmpreg, ....
> > >       Which will make memset sequences smaller.  I agree with Richi that 
> > > HJ's
> > >       patch that adds new cloar block expander is probably not a right 
> > > place
> > >       for solving the problem.
> > >
> > >       Ideall we should catch repeated constants more generally since
> > >       this appears elsewhere too.
> > >       I am not quite sure where to fit it best.  We already have a
> > >       machine specific task that loads 0 into SSE register which is kind
> > >       of similar to this as well.
> > >   3) Figure out what are reasonable MOVE_RATIO/CLEAR_RATIO defaults
> > >   4) Possibly go with the entry point idea?
Considering the test results on microbenchmark and actual workroads,
the increase in codesize is not much (potentially not much impact on
icache and processor front-end), and in practice both scalar move and
SSE move are better than rep stos (size less than a specific
constant). So maybe we'll just adopt H.J's new patch.
Any thoughts?

> > > Honza
> >
> > Here is the v3 patch.  It no longer uses "rep mov/stos".   Lili, can you 
> > measure
> > its performance impact on Intel and AMD cpus?
> >
>
> Option: -march=x86-64-v3 -mtune=generic -O2
> For spec-fewflow (Git clone https://github.com/kronbichler/spec-femflow.git): 
>  The speed improved by 38% on ZNVER5 and 26% on ADL (p core),  the code size 
> increased by 0.29%.
> For CPU2017: For ZNVER5 and ICELAKE, there is almost no impact on performance 
> and code size,  only the code size of 520.omnetpp increased by 1.8%.
> For latest Linux kernel: The code size increased by 0.32%.
>
> Lili.
>
> > The updated generic has
> >
> > Update memcpy and memset inline strategies for -mtune=generic:
> >
> > 1. Don't align memory.
> > 2. For known sizes, unroll loop with 4 moves or stores per iteration
> >    without aligning the loop, up to 256 bytes.
> > 3. For unknown sizes, use memcpy/memset.
> > 4. Since each loop iteration has 4 stores and 8 stores for zeroing with
> >    unroll loop may be needed, change CLEAR_RATIO to 10 so that zeroing
> >    up to 72 bytes are fully unrolled with 9 stores without SSE.
> >
> > Use move_by_pieces and store_by_pieces for memcpy and memset epilogues
> > with the fixed epilogue size to enable overlapping moves and stores.
> >
> > gcc/
> >
> > PR target/102294
> > PR target/119596
> > PR target/119703
> > PR target/119704
> > * builtins.cc (builtin_memset_gen_str): Make it global.
> > * builtins.h (builtin_memset_gen_str): New.
> > * config/i386/i386-expand.cc (expand_cpymem_epilogue): Use
> > move_by_pieces.
> > (expand_setmem_epilogue): Use store_by_pieces.
> > (ix86_expand_set_or_cpymem): Pass val_exp, instead of vec_promoted_val,
> > to expand_setmem_epilogue.
> > * config/i386/x86-tune-costs.h (generic_memcpy): Updated.
> > (generic_memset): Likewise.
> > (generic_cost): Change CLEAR_RATIO to 10.
> >
> > gcc/testsuite/
> >
> > PR target/102294
> > PR target/119596
> > PR target/119703
> > PR target/119704
> > * gcc.target/i386/auto-init-padding-3.c: Expect XMM stores.
> > * gcc.target/i386/auto-init-padding-9.c: Expect loop.
> > * gcc.target/i386/memcpy-strategy-12.c: New test.
> > * gcc.target/i386/memcpy-strategy-13.c: Likewise.
> > * gcc.target/i386/memset-strategy-25.c: Likewise.
> > * gcc.target/i386/memset-strategy-26.c: Likewise.
> > * gcc.target/i386/memset-strategy-27.c: Likewise.
> > * gcc.target/i386/memset-strategy-28.c: Likewise.
> > * gcc.target/i386/memset-strategy-29.c: Likewise.
> > * gcc.target/i386/memset-strategy-30.c: Likewise.
> > * gcc.target/i386/memset-strategy-31.c: Likewise.
> > * gcc.target/i386/mvc17.c: Fail with "rep mov"
> > * gcc.target/i386/pr111657-1.c: Scan for unrolled loop.  Fail with "rep 
> > mov".
> > * gcc.target/i386/shrink_wrap_1.c: Also pass -mmemset-
> > strategy=rep_8byte:-1:align.
> > * gcc.target/i386/sw-1.c: Also pass -mstringop-strategy=rep_byte.
> >
> >
> > --
> > H.J.



-- 
BR,
Hongtao

Reply via email to