On Sat, Jun 28, 2025 at 12:24 AM Cui, Lili <lili....@intel.com> wrote: > > > > > -----Original Message----- > > From: Cui, Lili > > Sent: Friday, June 27, 2025 5:04 PM > > To: H.J. Lu <hjl.to...@gmail.com> > > Cc: ubiz...@gmail.com; gcc-patches@gcc.gnu.org; Liu, Hongtao > > <hongtao....@intel.com>; richard.guent...@gmail.com; Michael Matz > > <m...@suse.de>; Sam James <s...@gentoo.org>; kenjin4...@gmail.com > > Subject: RE: [PATCH V3] x86: Enable separate shrink wrapping > > > > > -----Original Message----- > > > From: H.J. Lu <hjl.to...@gmail.com> > > > Sent: Friday, June 27, 2025 4:48 PM > > > To: Cui, Lili <lili....@intel.com> > > > Cc: ubiz...@gmail.com; gcc-patches@gcc.gnu.org; Liu, Hongtao > > > <hongtao....@intel.com>; richard.guent...@gmail.com; Michael Matz > > > <m...@suse.de>; Sam James <s...@gentoo.org>; kenjin4...@gmail.com > > > Subject: Re: [PATCH V3] x86: Enable separate shrink wrapping > > > > > > On Tue, Jun 17, 2025 at 10:04 PM Cui, Lili <lili....@intel.com> wrote: > > > > > > > > From: Lili Cui <lili....@intel.com> > > > > > > > > Hi Uros, > > > > > > > > This is patch v3, the main changes are as follows. > > > > > > > > 1. Added a pro_epilogue_adjust_stack_add_nocc in i386.md to add > > > > memory > > > clobber for lea/mov. > > > > 2. Adjusted some formatting issues. > > > > 3. Added scan-rtl-dumps for ia32 in shrink_wrap_separate.C. > > > > > > > > Collected spec2017 performance on ZNVER5, EMR and ICELAKE. No > > > performance regression was observed. > > > > For O2 multi-copy : > > > > 511.povray_r improved by 2.8% on ZNVER5. > > > > 511.povray_r improved by 4.2% on EMR > > > > > > > > Bootstrapped & regtested on x86-64-pc-linux-gnu. > > > > Use this patch to build the latest Linux kernel and boot successfully. > > > > > > > > Thanks, > > > > Lili. > > > > > > > > > > > > This commit implements the target macros (TARGET_SHRINK_WRAP_*) > > that > > > > enable separate shrink wrapping for function prologues/epilogues in > > > > x86. > > > > > > > > When performing separate shrink wrapping, we choose to use mov > > > > instead of push/pop, because using push/pop is more complicated to > > > > handle rsp adjustment and may lose performance, so here we choose to > > > > use mov, which has a small impact on code size, but guarantees > > performance. > > > > > > > > Using mov means we need to use sub/add to maintain the stack frame. > > > > In some special cases, we need to use lea to prevent affecting EFlags. > > > > > > > > Avoid inserting sub between test-je-jle to change EFlags, lea should > > > > be used here. > > > > > > > > foo: > > > > xorl %eax, %eax > > > > testl %edi, %edi > > > > je .L11 > > > > sub $16, %rsp ------> leaq -16(%rsp), %rsp > > > > movq %r13, 8(%rsp) > > > > movl $1, %r13d > > > > jle .L4 > > > > > > > > Tested against SPEC CPU 2017, this change always has a net-positive > > > > effect on the dynamic instruction count. See the following table > > > > for the breakdown on how this reduces the number of dynamic > > > > instructions per workload on a like-for-like (with/without this commit): > > > > > > > > instruction count base with commit (commit-base)/commit > > > > 502.gcc_r 98666845943 96891561634 -1.80% > > > > 526.blender_r 6.21226E+11 6.12992E+11 -1.33% > > > > 520.omnetpp_r 1.1241E+11 1.11093E+11 -1.17% > > > > 500.perlbench_r 1271558717 1263268350 -0.65% > > > > 523.xalancbmk_r 2.20103E+11 2.18836E+11 -0.58% > > > > 531.deepsjeng_r 2.73591E+11 2.72114E+11 -0.54% > > > > 500.perlbench_r 64195557393 63881512409 -0.49% > > > > 541.leela_r 2.99097E+11 2.98245E+11 -0.29% > > > > 548.exchange2_r 1.27976E+11 1.27784E+11 -0.15% > > > > 527.cam4_r 88981458425 88887334679 -0.11% > > > > 554.roms_r 2.60072E+11 2.59809E+11 -0.10% > > > > > > > > Collected spec2017 performance on ZNVER5, EMR and ICELAKE. No > > > performance regression was observed. > > > > > > > > For O2 multi-copy : > > > > 511.povray_r improved by 2.8% on ZNVER5. > > > > 511.povray_r improved by 4% on EMR > > > > 511.povray_r improved by 3.3 % ~ 4.6% on ICELAKE. > > > > > > > > gcc/ChangeLog: > > > > > > > > * config/i386/i386-protos.h (ix86_get_separate_components): > > > > New function. > > > > (ix86_components_for_bb): Likewise. > > > > (ix86_disqualify_components): Likewise. > > > > (ix86_emit_prologue_components): Likewise. > > > > (ix86_emit_epilogue_components): Likewise. > > > > (ix86_set_handled_components): Likewise. > > > > * config/i386/i386.cc (save_regs_using_push_pop): > > > > Split from ix86_compute_frame_layout. > > > > (ix86_compute_frame_layout): > > > > Use save_regs_using_push_pop. > > > > (pro_epilogue_adjust_stack): > > > > Use gen_pro_epilogue_adjust_stack_add_nocc. > > > > (ix86_expand_prologue): Add some assertions and adjust > > > > the stack frame at the beginning of the prolog for shrink > > > > wrapping separate. > > > > (ix86_emit_save_regs_using_mov): > > > > Skip registers that are wrapped separately. > > > > (ix86_emit_restore_regs_using_mov): Likewise. > > > > (ix86_expand_epilogue): Add some assertions and set > > > > restore_regs_via_mov to true for shrink wrapping separate. > > > > (ix86_get_separate_components): New function. > > > > (ix86_components_for_bb): Likewise. > > > > (ix86_disqualify_components): Likewise. > > > > (ix86_emit_prologue_components): Likewise. > > > > (ix86_emit_epilogue_components): Likewise. > > > > (ix86_set_handled_components): Likewise. > > > > (TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS): Define. > > > > (TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB): Likewise. > > > > (TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS): Likewise. > > > > (TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS): Likewise. > > > > (TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS): Likewise. > > > > (TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS): Likewise. > > > > * config/i386/i386.h (struct machine_function):Add > > > > reg_is_wrapped_separately array for register wrapping > > > > information. > > > > * config/i386/i386.md > > > > (@pro_epilogue_adjust_stack_add_nocc<mode>): New. > > > > > > > > gcc/testsuite/ChangeLog: > > > > > > > > * gcc.target/x86_64/abi/callabi/leaf-2.c: Adjust the test. > > > > * gcc.target/i386/interrupt-16.c: Likewise. > > > > * gfortran.dg/guality/arg1.f90: Likewise. > > > > * gcc.target/i386/avx10_2-comibf-1.c: Likewise. > > > > * g++.target/i386/shrink_wrap_separate.C: New test. > > > > * gcc.target/i386/shrink_wrap_separate_check_lea.c: Likewise. > > > > > > > > Co-authored-by: Michael Matz <m...@suse.de> > > > > --- > > > > gcc/config/i386/i386-protos.h | 7 + > > > > gcc/config/i386/i386.cc | 332 +++++++++++++++--- > > > > gcc/config/i386/i386.h | 4 + > > > > gcc/config/i386/i386.md | 22 ++ > > > > .../g++.target/i386/shrink_wrap_separate.C | 25 ++ > > > > .../gcc.target/i386/avx10_2-comibf-1.c | 2 +- > > > > gcc/testsuite/gcc.target/i386/interrupt-16.c | 4 +- > > > > .../i386/shrink_wrap_separate_check_lea.c | 29 ++ > > > > .../gcc.target/x86_64/abi/callabi/leaf-2.c | 2 +- > > > > gcc/testsuite/gfortran.dg/guality/arg1.f90 | 2 +- > > > > 10 files changed, 379 insertions(+), 50 deletions(-) create mode > > > > 100644 gcc/testsuite/g++.target/i386/shrink_wrap_separate.C > > > > create mode 100644 > > > > gcc/testsuite/gcc.target/i386/shrink_wrap_separate_check_lea.c > > > > > > > > > > Hi Lili, > > > > > > Your patch caused: > > > > > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120840 > > > > > I checked the shrink wrap separate behavior and didn't see anything unusual. > This issue can be reproduced with the following options. It is not introduced > by my patch. I will continue investigating this issue. > > "-fno-shrink-wrap-separate -O2 -fno-stack-protector > -mtune-ctrl=prologue_using_move,epilogue_using_move " >
This one works for me. -- H.J.
diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc index c1c0cf9ff80..133b3680626 100644 --- a/gcc/config/i386/i386-expand.cc +++ b/gcc/config/i386/i386-expand.cc @@ -10383,6 +10383,8 @@ ix86_expand_call (rtx retval, rtx fnaddr, rtx callarg1, && !MMX_REGNO_P (i)) clobber_reg (&use, gen_rtx_REG (GET_MODE (regno_reg_rtx[i]), i)); + + cfun->machine->call_preserve_none_p = true; } if (vec_len > 1) diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc index 5e9559f5c05..fc57c4d3513 100644 --- a/gcc/config/i386/i386.cc +++ b/gcc/config/i386/i386.cc @@ -10201,6 +10201,7 @@ ix86_expand_epilogue (int style) restore_regs_via_mov = true; else if (crtl->shrink_wrapped_separate || (TARGET_EPILOGUE_USING_MOVE + && !cfun->machine->call_preserve_none_p && cfun->machine->use_fast_prologue_epilogue && (frame.nregs > 1 || m->fs.sp_offset != reg_save_offset))) @@ -28228,6 +28229,7 @@ ix86_get_separate_components (void) || cfun->machine->func_type != TYPE_NORMAL || TARGET_SEH || crtl->stack_realign_needed + || m->call_preserve_none_p || m->call_ms2sysv) return components; diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h index a275a32682e..69c6c7f819b 100644 --- a/gcc/config/i386/i386.h +++ b/gcc/config/i386/i386.h @@ -2937,6 +2937,10 @@ struct GTY(()) machine_function { /* The largest alignment, in bytes, of stack slot actually used. */ unsigned int max_used_stack_alignment; + /* True if function with preserve_none/no_callee_saved_registers + attribute is called. */ + BOOL_BITFIELD call_preserve_none_p : 1; + /* During prologue/epilogue generation, the current frame state. Otherwise, the frame state at the end of the prologue. */ struct machine_frame_state fs;