On Tue, Jun 17, 2025 at 10:04 PM Cui, Lili <lili....@intel.com> wrote:
>
> From: Lili Cui <lili....@intel.com>
>
> Hi Uros,
>
> This is patch v3, the main changes are as follows.
>
> 1. Added a pro_epilogue_adjust_stack_add_nocc in i386.md to add memory 
> clobber for lea/mov.
> 2. Adjusted some formatting issues.
> 3. Added scan-rtl-dumps for ia32 in shrink_wrap_separate.C.
>
> Collected spec2017 performance on ZNVER5, EMR and ICELAKE. No performance 
> regression was observed.
> For O2 multi-copy :
> 511.povray_r improved by 2.8% on ZNVER5.
> 511.povray_r improved by 4.2% on EMR
>
> Bootstrapped & regtested on x86-64-pc-linux-gnu.
> Use this patch to build the latest Linux kernel and boot successfully.
>
> Thanks,
> Lili.
>
>
> This commit implements the target macros (TARGET_SHRINK_WRAP_*) that
> enable separate shrink wrapping for function prologues/epilogues in
> x86.
>
> When performing separate shrink wrapping, we choose to use mov instead
> of push/pop, because using push/pop is more complicated to handle rsp
> adjustment and may lose performance, so here we choose to use mov, which
> has a small impact on code size, but guarantees performance.
>
> Using mov means we need to use sub/add to maintain the stack frame. In
> some special cases, we need to use lea to prevent affecting EFlags.
>
> Avoid inserting sub between test-je-jle to change EFlags, lea should be
> used here.
>
>     foo:
>         xorl    %eax, %eax
>         testl   %edi, %edi
>         je      .L11
>         sub     $16, %rsp  ------> leaq    -16(%rsp), %rsp
>         movq    %r13, 8(%rsp)
>         movl    $1, %r13d
>         jle     .L4
>
> Tested against SPEC CPU 2017, this change always has a net-positive
> effect on the dynamic instruction count.  See the following table for
> the breakdown on how this reduces the number of dynamic instructions
> per workload on a like-for-like (with/without this commit):
>
> instruction count       base            with commit (commit-base)/commit
> 502.gcc_r               98666845943     96891561634     -1.80%
> 526.blender_r           6.21226E+11     6.12992E+11     -1.33%
> 520.omnetpp_r           1.1241E+11      1.11093E+11     -1.17%
> 500.perlbench_r         1271558717      1263268350      -0.65%
> 523.xalancbmk_r         2.20103E+11     2.18836E+11     -0.58%
> 531.deepsjeng_r         2.73591E+11     2.72114E+11     -0.54%
> 500.perlbench_r         64195557393     63881512409     -0.49%
> 541.leela_r             2.99097E+11     2.98245E+11     -0.29%
> 548.exchange2_r         1.27976E+11     1.27784E+11     -0.15%
> 527.cam4_r              88981458425     88887334679     -0.11%
> 554.roms_r              2.60072E+11     2.59809E+11     -0.10%
>
> Collected spec2017 performance on ZNVER5, EMR and ICELAKE. No performance 
> regression was observed.
>
> For O2 multi-copy :
> 511.povray_r improved by 2.8% on ZNVER5.
> 511.povray_r improved by 4% on EMR
> 511.povray_r improved by 3.3 % ~ 4.6% on ICELAKE.
>
> gcc/ChangeLog:
>
>         * config/i386/i386-protos.h (ix86_get_separate_components):
>         New function.
>         (ix86_components_for_bb): Likewise.
>         (ix86_disqualify_components): Likewise.
>         (ix86_emit_prologue_components): Likewise.
>         (ix86_emit_epilogue_components): Likewise.
>         (ix86_set_handled_components): Likewise.
>         * config/i386/i386.cc (save_regs_using_push_pop):
>         Split from ix86_compute_frame_layout.
>         (ix86_compute_frame_layout):
>         Use save_regs_using_push_pop.
>         (pro_epilogue_adjust_stack):
>         Use gen_pro_epilogue_adjust_stack_add_nocc.
>         (ix86_expand_prologue): Add some assertions and adjust
>         the stack frame at the beginning of the prolog for shrink
>         wrapping separate.
>         (ix86_emit_save_regs_using_mov):
>         Skip registers that are wrapped separately.
>         (ix86_emit_restore_regs_using_mov): Likewise.
>         (ix86_expand_epilogue): Add some assertions and set
>         restore_regs_via_mov to true for shrink wrapping separate.
>         (ix86_get_separate_components): New function.
>         (ix86_components_for_bb): Likewise.
>         (ix86_disqualify_components): Likewise.
>         (ix86_emit_prologue_components): Likewise.
>         (ix86_emit_epilogue_components): Likewise.
>         (ix86_set_handled_components): Likewise.
>         (TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS): Define.
>         (TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB): Likewise.
>         (TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS): Likewise.
>         (TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS): Likewise.
>         (TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS): Likewise.
>         (TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS): Likewise.
>         * config/i386/i386.h (struct machine_function):Add
>         reg_is_wrapped_separately array for register wrapping
>         information.
>         * config/i386/i386.md
>         (@pro_epilogue_adjust_stack_add_nocc<mode>): New.
>
> gcc/testsuite/ChangeLog:
>
>         * gcc.target/x86_64/abi/callabi/leaf-2.c: Adjust the test.
>         * gcc.target/i386/interrupt-16.c: Likewise.
>         * gfortran.dg/guality/arg1.f90: Likewise.
>         * gcc.target/i386/avx10_2-comibf-1.c: Likewise.
>         * g++.target/i386/shrink_wrap_separate.C: New test.
>         * gcc.target/i386/shrink_wrap_separate_check_lea.c: Likewise.
>
> Co-authored-by: Michael Matz <m...@suse.de>
> ---
>  gcc/config/i386/i386-protos.h                 |   7 +
>  gcc/config/i386/i386.cc                       | 332 +++++++++++++++---
>  gcc/config/i386/i386.h                        |   4 +
>  gcc/config/i386/i386.md                       |  22 ++
>  .../g++.target/i386/shrink_wrap_separate.C    |  25 ++
>  .../gcc.target/i386/avx10_2-comibf-1.c        |   2 +-
>  gcc/testsuite/gcc.target/i386/interrupt-16.c  |   4 +-
>  .../i386/shrink_wrap_separate_check_lea.c     |  29 ++
>  .../gcc.target/x86_64/abi/callabi/leaf-2.c    |   2 +-
>  gcc/testsuite/gfortran.dg/guality/arg1.f90    |   2 +-
>  10 files changed, 379 insertions(+), 50 deletions(-)
>  create mode 100644 gcc/testsuite/g++.target/i386/shrink_wrap_separate.C
>  create mode 100644 
> gcc/testsuite/gcc.target/i386/shrink_wrap_separate_check_lea.c
>

Hi Lili,

Your patch caused:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120840

The enclosed patch seemed to fix it.   It looks like x86 separate
shrink wrapping
doesn't properly restore all registers when function with preserve_none or
no_callee_saved_registers is called.   Can you find a run-time testcase and
fix it?

Thanks.

-- 
H.J.
diff --git a/gcc/config/i386/i386-expand.cc b/gcc/config/i386/i386-expand.cc
index 24b0a766265..b32d0228193 100644
--- a/gcc/config/i386/i386-expand.cc
+++ b/gcc/config/i386/i386-expand.cc
@@ -10393,6 +10393,8 @@ ix86_expand_call (rtx retval, rtx fnaddr, rtx callarg1,
 	    && !MMX_REGNO_P (i))
 	  clobber_reg (&use,
 		       gen_rtx_REG (GET_MODE (regno_reg_rtx[i]), i));
+
+      cfun->machine->call_preserve_none_p = true;
     }
 
   if (vec_len > 1)
diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index 0dad62b5374..7e35f1d772e 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -28242,6 +28242,7 @@ ix86_get_separate_components (void)
       || cfun->machine->func_type != TYPE_NORMAL
       || TARGET_SEH
       || crtl->stack_realign_needed
+      || m->call_preserve_none_p
       || m->call_ms2sysv)
     return components;
 
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index 880fdbadd2f..a4240c66362 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -2932,6 +2932,10 @@ struct GTY(()) machine_function {
   /* The largest alignment, in bytes, of stack slot actually used.  */
   unsigned int max_used_stack_alignment;
 
+  /* True if function with preserve_none/no_callee_saved_registers
+     attribute is called.  */
+  BOOL_BITFIELD call_preserve_none_p : 1;
+
   /* During prologue/epilogue generation, the current frame state.
      Otherwise, the frame state at the end of the prologue.  */
   struct machine_frame_state fs;

Reply via email to