On Sat, Jun 28, 2025 at 12:24 AM Cui, Lili <lili....@intel.com> wrote:
>
>
>
> > -----Original Message-----
> > From: Cui, Lili
> > Sent: Friday, June 27, 2025 5:04 PM
> > To: H.J. Lu <hjl.to...@gmail.com>
> > Cc: ubiz...@gmail.com; gcc-patches@gcc.gnu.org; Liu, Hongtao
> > <hongtao....@intel.com>; richard.guent...@gmail.com; Michael Matz
> > <m...@suse.de>; Sam James <s...@gentoo.org>; kenjin4...@gmail.com
> > Subject: RE: [PATCH V3] x86: Enable separate shrink wrapping
> >
> > > -----Original Message-----
> > > From: H.J. Lu <hjl.to...@gmail.com>
> > > Sent: Friday, June 27, 2025 4:48 PM
> > > To: Cui, Lili <lili....@intel.com>
> > > Cc: ubiz...@gmail.com; gcc-patches@gcc.gnu.org; Liu, Hongtao
> > > <hongtao....@intel.com>; richard.guent...@gmail.com; Michael Matz
> > > <m...@suse.de>; Sam James <s...@gentoo.org>; kenjin4...@gmail.com
> > > Subject: Re: [PATCH V3] x86: Enable separate shrink wrapping
> > >
> > > On Tue, Jun 17, 2025 at 10:04 PM Cui, Lili <lili....@intel.com> wrote:
> > > >
> > > > From: Lili Cui <lili....@intel.com>
> > > >
> > > > Hi Uros,
> > > >
> > > > This is patch v3, the main changes are as follows.
> > > >
> > > > 1. Added a pro_epilogue_adjust_stack_add_nocc in i386.md to add
> > > > memory
> > > clobber for lea/mov.
> > > > 2. Adjusted some formatting issues.
> > > > 3. Added scan-rtl-dumps for ia32 in shrink_wrap_separate.C.
> > > >
> > > > Collected spec2017 performance on ZNVER5, EMR and ICELAKE. No
> > > performance regression was observed.
> > > > For O2 multi-copy :
> > > > 511.povray_r improved by 2.8% on ZNVER5.
> > > > 511.povray_r improved by 4.2% on EMR
> > > >
> > > > Bootstrapped & regtested on x86-64-pc-linux-gnu.
> > > > Use this patch to build the latest Linux kernel and boot successfully.
> > > >
> > > > Thanks,
> > > > Lili.
> > > >
> > > >
> > > > This commit implements the target macros (TARGET_SHRINK_WRAP_*)
> > that
> > > > enable separate shrink wrapping for function prologues/epilogues in
> > > > x86.
> > > >
> > > > When performing separate shrink wrapping, we choose to use mov
> > > > instead of push/pop, because using push/pop is more complicated to
> > > > handle rsp adjustment and may lose performance, so here we choose to
> > > > use mov, which has a small impact on code size, but guarantees
> > performance.
> > > >
> > > > Using mov means we need to use sub/add to maintain the stack frame.
> > > > In some special cases, we need to use lea to prevent affecting EFlags.
> > > >
> > > > Avoid inserting sub between test-je-jle to change EFlags, lea should
> > > > be used here.
> > > >
> > > >     foo:
> > > >         xorl    %eax, %eax
> > > >         testl   %edi, %edi
> > > >         je      .L11
> > > >         sub     $16, %rsp  ------> leaq    -16(%rsp), %rsp
> > > >         movq    %r13, 8(%rsp)
> > > >         movl    $1, %r13d
> > > >         jle     .L4
> > > >
> > > > Tested against SPEC CPU 2017, this change always has a net-positive
> > > > effect on the dynamic instruction count.  See the following table
> > > > for the breakdown on how this reduces the number of dynamic
> > > > instructions per workload on a like-for-like (with/without this commit):
> > > >
> > > > instruction count       base            with commit (commit-base)/commit
> > > > 502.gcc_r               98666845943     96891561634     -1.80%
> > > > 526.blender_r           6.21226E+11     6.12992E+11     -1.33%
> > > > 520.omnetpp_r           1.1241E+11      1.11093E+11     -1.17%
> > > > 500.perlbench_r         1271558717      1263268350      -0.65%
> > > > 523.xalancbmk_r         2.20103E+11     2.18836E+11     -0.58%
> > > > 531.deepsjeng_r         2.73591E+11     2.72114E+11     -0.54%
> > > > 500.perlbench_r         64195557393     63881512409     -0.49%
> > > > 541.leela_r             2.99097E+11     2.98245E+11     -0.29%
> > > > 548.exchange2_r         1.27976E+11     1.27784E+11     -0.15%
> > > > 527.cam4_r              88981458425     88887334679     -0.11%
> > > > 554.roms_r              2.60072E+11     2.59809E+11     -0.10%
> > > >
> > > > Collected spec2017 performance on ZNVER5, EMR and ICELAKE. No
> > > performance regression was observed.
> > > >
> > > > For O2 multi-copy :
> > > > 511.povray_r improved by 2.8% on ZNVER5.
> > > > 511.povray_r improved by 4% on EMR
> > > > 511.povray_r improved by 3.3 % ~ 4.6% on ICELAKE.
> > > >
> > > > gcc/ChangeLog:
> > > >
> > > >         * config/i386/i386-protos.h (ix86_get_separate_components):
> > > >         New function.
> > > >         (ix86_components_for_bb): Likewise.
> > > >         (ix86_disqualify_components): Likewise.
> > > >         (ix86_emit_prologue_components): Likewise.
> > > >         (ix86_emit_epilogue_components): Likewise.
> > > >         (ix86_set_handled_components): Likewise.
> > > >         * config/i386/i386.cc (save_regs_using_push_pop):
> > > >         Split from ix86_compute_frame_layout.
> > > >         (ix86_compute_frame_layout):
> > > >         Use save_regs_using_push_pop.
> > > >         (pro_epilogue_adjust_stack):
> > > >         Use gen_pro_epilogue_adjust_stack_add_nocc.
> > > >         (ix86_expand_prologue): Add some assertions and adjust
> > > >         the stack frame at the beginning of the prolog for shrink
> > > >         wrapping separate.
> > > >         (ix86_emit_save_regs_using_mov):
> > > >         Skip registers that are wrapped separately.
> > > >         (ix86_emit_restore_regs_using_mov): Likewise.
> > > >         (ix86_expand_epilogue): Add some assertions and set
> > > >         restore_regs_via_mov to true for shrink wrapping separate.
> > > >         (ix86_get_separate_components): New function.
> > > >         (ix86_components_for_bb): Likewise.
> > > >         (ix86_disqualify_components): Likewise.
> > > >         (ix86_emit_prologue_components): Likewise.
> > > >         (ix86_emit_epilogue_components): Likewise.
> > > >         (ix86_set_handled_components): Likewise.
> > > >         (TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS): Define.
> > > >         (TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB): Likewise.
> > > >         (TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS): Likewise.
> > > >         (TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS): Likewise.
> > > >         (TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS): Likewise.
> > > >         (TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS): Likewise.
> > > >         * config/i386/i386.h (struct machine_function):Add
> > > >         reg_is_wrapped_separately array for register wrapping
> > > >         information.
> > > >         * config/i386/i386.md
> > > >         (@pro_epilogue_adjust_stack_add_nocc<mode>): New.
> > > >
> > > > gcc/testsuite/ChangeLog:
> > > >
> > > >         * gcc.target/x86_64/abi/callabi/leaf-2.c: Adjust the test.
> > > >         * gcc.target/i386/interrupt-16.c: Likewise.
> > > >         * gfortran.dg/guality/arg1.f90: Likewise.
> > > >         * gcc.target/i386/avx10_2-comibf-1.c: Likewise.
> > > >         * g++.target/i386/shrink_wrap_separate.C: New test.
> > > >         * gcc.target/i386/shrink_wrap_separate_check_lea.c: Likewise.
> > > >
> > > > Co-authored-by: Michael Matz <m...@suse.de>
> > > > ---
> > > >  gcc/config/i386/i386-protos.h                 |   7 +
> > > >  gcc/config/i386/i386.cc                       | 332 +++++++++++++++---
> > > >  gcc/config/i386/i386.h                        |   4 +
> > > >  gcc/config/i386/i386.md                       |  22 ++
> > > >  .../g++.target/i386/shrink_wrap_separate.C    |  25 ++
> > > >  .../gcc.target/i386/avx10_2-comibf-1.c        |   2 +-
> > > >  gcc/testsuite/gcc.target/i386/interrupt-16.c  |   4 +-
> > > >  .../i386/shrink_wrap_separate_check_lea.c     |  29 ++
> > > >  .../gcc.target/x86_64/abi/callabi/leaf-2.c    |   2 +-
> > > >  gcc/testsuite/gfortran.dg/guality/arg1.f90    |   2 +-
> > > >  10 files changed, 379 insertions(+), 50 deletions(-)  create mode
> > > > 100644 gcc/testsuite/g++.target/i386/shrink_wrap_separate.C
> > > >  create mode 100644
> > > > gcc/testsuite/gcc.target/i386/shrink_wrap_separate_check_lea.c
> > > >
> > >
> > > Hi Lili,
> > >
> > > Your patch caused:
> > >
> > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120840
> > >
>
> I checked the shrink wrap separate behavior and didn't see anything unusual. 
> This issue can be reproduced with the following options. It is not introduced 
> by my patch. I will continue investigating this issue.
>
> "-fno-shrink-wrap-separate  -O2 -fno-stack-protector  
> -mtune-ctrl=prologue_using_move,epilogue_using_move "
>

This one works for me.

-- 
H.J.
diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
index c1c0cf9ff80..133b3680626 100644
--- a/gcc/config/i386/i386-expand.cc
+++ b/gcc/config/i386/i386-expand.cc
@@ -10383,6 +10383,8 @@ ix86_expand_call (rtx retval, rtx fnaddr, rtx callarg1,
 	    && !MMX_REGNO_P (i))
 	  clobber_reg (&use,
 		       gen_rtx_REG (GET_MODE (regno_reg_rtx[i]), i));
+
+      cfun->machine->call_preserve_none_p = true;
     }
 
   if (vec_len > 1)
diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index 5e9559f5c05..fc57c4d3513 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -10201,6 +10201,7 @@ ix86_expand_epilogue (int style)
     restore_regs_via_mov = true;
   else if (crtl->shrink_wrapped_separate
 	   || (TARGET_EPILOGUE_USING_MOVE
+	       && !cfun->machine->call_preserve_none_p
 	       && cfun->machine->use_fast_prologue_epilogue
 	       && (frame.nregs > 1
 		   || m->fs.sp_offset != reg_save_offset)))
@@ -28228,6 +28229,7 @@ ix86_get_separate_components (void)
       || cfun->machine->func_type != TYPE_NORMAL
       || TARGET_SEH
       || crtl->stack_realign_needed
+      || m->call_preserve_none_p
       || m->call_ms2sysv)
     return components;
 
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index a275a32682e..69c6c7f819b 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -2937,6 +2937,10 @@ struct GTY(()) machine_function {
   /* The largest alignment, in bytes, of stack slot actually used.  */
   unsigned int max_used_stack_alignment;
 
+  /* True if function with preserve_none/no_callee_saved_registers
+     attribute is called.  */
+  BOOL_BITFIELD call_preserve_none_p : 1;
+
   /* During prologue/epilogue generation, the current frame state.
      Otherwise, the frame state at the end of the prologue.  */
   struct machine_frame_state fs;

Reply via email to