> -----Original Message----- > From: H.J. Lu <hjl.to...@gmail.com> > Sent: Monday, June 16, 2025 10:08 PM > To: Jan Hubicka <hubi...@ucw.cz> > Cc: Uros Bizjak <ubiz...@gmail.com>; Cui, Lili <lili....@intel.com>; gcc- > patc...@gcc.gnu.org; Liu, Hongtao <hongtao....@intel.com>; > mjgu...@gmail.com > Subject: [PATCH v3] x86: Update memcpy/memset inline strategies for - > mtune=generic > > On Mon, Jun 16, 2025 at 12:19 AM Jan Hubicka <hubi...@ucw.cz> wrote: > > > > > > > > Perhaps someone is interested in the following thread from LKML: > > > > > > "[PATCH v2] x86: prevent gcc from emitting rep movsq/stosq for inlined > ops" > > > > > > https://lore.kernel.org/lkml/20250605164733.737543-1-mjguzik@gmail.c > > > om/ > > > > > > There are several PRs regarding memcpy/memset linked from the above > message. > > > > > > Please also note a message from Linus from the above thread: > > > > > > https://lore.kernel.org/lkml/CAHk-=wg1qQLWKPyvxxZnXwboT48-- > LKJuCJjF8 > > > phdhrxv0u...@mail.gmail.com/ > > > > This is my understanding of the situation. > > Please correct me where I am wrong. > > > > According to Linus, the calls in kernel are more expensive then > > elsewhere due to mitigations. I wonder if -minline-all-stringops > > would make sense here. > > > > Linus writes about the alternate entryopint for memcpy with > > non-standard calling convention, which we also discussed few times in the > past. > > I think having call convention for memset/memcpy that only clobbers > > SI/DE/CX and nothing else (especially no SSE regs) makes sense. > > > > This should make offlined mempcy noticeably cheaper, specially when > > called from loops that needs SSE and the implmentation can be done w/o > > cloberring extra registers for small blocks while it will have enoug > > time to spill for large ones. > > > > The other patch does > > +KBUILD_CFLAGS += > > +-mmemcpy-strategy=unrolled_loop:256:noalign,libcall:-1:noalign > > +KBUILD_CFLAGS += > > +-mmemset-strategy=unrolled_loop:256:noalign,libcall:-1:noalign > > for non-native CPUs (so something we should fix for generic tuning). > > > > Which is about our current default to rep stosq that does not work > > well on Intel hardware. We do loop for blocks up to 32bytes and rep > > stosq up to 8k. > > > > We now have X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB for Intel > cores, but > > no changes for generic yet (it is on my TODO to do some more testing > > on Zen). > > > > So I think we can do following: > > 1) decide whether to go with > X86_TUNE_PREFER_KNOWN_REP_MOVSB_STOSB > > or relpace rep_prefix_8_byte by unrolled_loop > > 2) fix issue with repeated constants. I.e. instead > > > > movq $0, .... > > movq $0, .... > > .... > > movq $0, .... > > Which we currently generate for memset fitting in CLEAR_RATIO by > > mov $0, tmpreg > > movq tmpreg, .... > > movq tmpreg, .... > > .... > > movq tmpreg, .... > > Which will make memset sequences smaller. I agree with Richi that > > HJ's > > patch that adds new cloar block expander is probably not a right place > > for solving the problem. > > > > Ideall we should catch repeated constants more generally since > > this appears elsewhere too. > > I am not quite sure where to fit it best. We already have a > > machine specific task that loads 0 into SSE register which is kind > > of similar to this as well. > > 3) Figure out what are reasonable MOVE_RATIO/CLEAR_RATIO defaults > > 4) Possibly go with the entry point idea? > > Honza > > Here is the v3 patch. It no longer uses "rep mov/stos". Lili, can you > measure > its performance impact on Intel and AMD cpus? >
Option: -march=x86-64-v3 -mtune=generic -O2 For spec-fewflow (Git clone https://github.com/kronbichler/spec-femflow.git): The speed improved by 38% on ZNVER5 and 26% on ADL (p core), the code size increased by 0.29%. For CPU2017: For ZNVER5 and ICELAKE, there is almost no impact on performance and code size, only the code size of 520.omnetpp increased by 1.8%. For latest Linux kernel: The code size increased by 0.32%. Lili. > The updated generic has > > Update memcpy and memset inline strategies for -mtune=generic: > > 1. Don't align memory. > 2. For known sizes, unroll loop with 4 moves or stores per iteration > without aligning the loop, up to 256 bytes. > 3. For unknown sizes, use memcpy/memset. > 4. Since each loop iteration has 4 stores and 8 stores for zeroing with > unroll loop may be needed, change CLEAR_RATIO to 10 so that zeroing > up to 72 bytes are fully unrolled with 9 stores without SSE. > > Use move_by_pieces and store_by_pieces for memcpy and memset epilogues > with the fixed epilogue size to enable overlapping moves and stores. > > gcc/ > > PR target/102294 > PR target/119596 > PR target/119703 > PR target/119704 > * builtins.cc (builtin_memset_gen_str): Make it global. > * builtins.h (builtin_memset_gen_str): New. > * config/i386/i386-expand.cc (expand_cpymem_epilogue): Use > move_by_pieces. > (expand_setmem_epilogue): Use store_by_pieces. > (ix86_expand_set_or_cpymem): Pass val_exp, instead of vec_promoted_val, > to expand_setmem_epilogue. > * config/i386/x86-tune-costs.h (generic_memcpy): Updated. > (generic_memset): Likewise. > (generic_cost): Change CLEAR_RATIO to 10. > > gcc/testsuite/ > > PR target/102294 > PR target/119596 > PR target/119703 > PR target/119704 > * gcc.target/i386/auto-init-padding-3.c: Expect XMM stores. > * gcc.target/i386/auto-init-padding-9.c: Expect loop. > * gcc.target/i386/memcpy-strategy-12.c: New test. > * gcc.target/i386/memcpy-strategy-13.c: Likewise. > * gcc.target/i386/memset-strategy-25.c: Likewise. > * gcc.target/i386/memset-strategy-26.c: Likewise. > * gcc.target/i386/memset-strategy-27.c: Likewise. > * gcc.target/i386/memset-strategy-28.c: Likewise. > * gcc.target/i386/memset-strategy-29.c: Likewise. > * gcc.target/i386/memset-strategy-30.c: Likewise. > * gcc.target/i386/memset-strategy-31.c: Likewise. > * gcc.target/i386/mvc17.c: Fail with "rep mov" > * gcc.target/i386/pr111657-1.c: Scan for unrolled loop. Fail with "rep mov". > * gcc.target/i386/shrink_wrap_1.c: Also pass -mmemset- > strategy=rep_8byte:-1:align. > * gcc.target/i386/sw-1.c: Also pass -mstringop-strategy=rep_byte. > > > -- > H.J.