RE: [PATCH V2] x86: Enable separate shrink wrapping

Cui, Lili Wed, 11 Jun 2025 02:04:38 -0700

> > From: Lili Cui <lili....@intel.com>
> >
> > Hi Uros,
> >
> > Thank you very much for providing detailed BKM to reproduce Linux kernel 
> > boot
> failure.  My patch and Matz's patch have this problem. We inserted a SUB 
> between
> TEST and JLE, and the SUB changes the value of EFlags. The branch JLE here 
> went
> wrong, and a null pointer appeared after passing through many functions. I
> created a small case for it and added it into my patch. I use lea instead of 
> sub/add
> to fix this issue. Shrink-wrap also benefits from the lea instruction.
> 
> Thanks for your thorough testing! I think that introducing a boot test for 
> the linux
> kernel (besides SPEC tests) in your testing is a great addition, and I would
> recommend to perform it especially for an invasive change like this. In 
> addition,
> Intel has a continuous integration testing for the linux kernel patches, 
> perhaps
> something similar can be used to boot and test linux with GCC changes.
> 
Agree, your suggestion makes sense.


> > Real case:
> > Avoid inserting sub between test-je-jle to change EFlags, lea should be 
> > used here
> >         xorl    %eax, %eax
> >         testl   %edi, %edi
> >         je      .L11
> >         sub     $16, %rsp  ------> leaq    -16(%rsp), %rsp
> 
> FAOD, you mean to insert SUB between TESTL and JE? The above case is not
> problematic, because flags are produced and consumed without being clobbered.
> 
> >         movq    %r13, 8(%rsp)
> >         movl    $1, %r13d
> >         jle     .L4
> >

Not between TESTL and JE, but between TESTL and JLE.

Before shrink-wrapping the sequence should be:
      
         push %r13    
         push ...       
         xorl   %eax, %eax
         testl  %edi, %edi
         je      .L11   --- ---> It is an early return.
         movl   $1, %r13d
         jle     .L4

shrink-wrapping-separate replaces "push %r13" with "sub $16, %rsp" + "movq 
%r13, 8(%rsp)" and inserts it before "movl $1, %r13d". "sub $16, %rsp" modifies 
the SF flag, and then we take the wrong branch.

> > GCC change:
> > Emit lea in ix86_expand_prologue/ ix86_expand_epilogue, peephole2 will 
> > decide
> whether to replace lea with sub/add depending on the situation.
> >
> >
> > I learned Matz's patch, the effect we achieved is basically the same, 
> > although
> there are some differences in the implementation methods.
> >
> > I incorporated some of Matz's good ideas into my patch.
> > 1. Matz added a lot of assertions to detect errors in time, and I also 
> > added some
> in this patch.
> > 2. Matz turned off shrink-wrap-separate judgments in some special scenarios,
> and I also added these judgments.
> >
> > Still keep the difference:
> > 1. The position of inserting SUB is different. Matz inserts SUB at the end 
> > of
> prolog, I put SUB at the first line of prologue. I keep my way here.
> > 2. Matz handles both SSE and general registers, while I only handle general
> registers. Enabling SSE registers is easy, but the SSE part seems to be 
> Windows ABI
> related. I have no way to test it yet, so I won't add it for now.
> > 3.  After reloading, some registers were eliminated by optimization are 
> > still placed
> in regs_ever_live, resulting in redundant push/pop when generating prologue
> later, my patch will eliminate these redundant registers and keep the stack 
> frame
> layout the same as before.
> >
> > Collected spec2017 performance on ZNVER5, EMR and ICELAKE. No
> performance regression was observed.
> > For O2 multi-copy :
> > 511.povray_r improved by 2.8% on ZNVER5.
> > 511.povray_r improved by 4% on EMR
> > 511.povray_r improved by 3.3 % ~ 4.6% on ICELAKE.
> >
> > Bootstrapped & regtested on x86-64-pc-linux-gnu.
> > Use this patch to build the latest Linux kernel and boot successfully.
> >
> > Thanks,
> > Lili.
> >
> >
> > This commit implements the target macros (TARGET_SHRINK_WRAP_*) that
> > enable separate shrink wrapping for function prologues/epilogues in
> > x86.
> >
> > When performing separate shrink wrapping, we choose to use mov instead
> > of push/pop, because using push/pop is more complicated to handle rsp
> > adjustment and may lose performance, so here we choose to use mov,
> > which has a small impact on code size, but guarantees performance.
> >
> > Using mov means we need to use sub/add to maintain the stack frame. In
> > some special cases, we need to use lea to prevent affecting EFlags.
> >
> > Avoid inserting sub between test-je-jle to change EFlags, lea should
> > be used here.
> >
> >     foo:
> >         xorl    %eax, %eax
> >         testl   %edi, %edi
> >         je      .L11
> >         sub     $16, %rsp  ------> leaq    -16(%rsp), %rsp
> 
> Also here, is the placement of SUB correct?
> 
Yes, it is before JLE.

> >         movq    %r13, 8(%rsp)
> >         movl    $1, %r13d
> >         jle     .L4
> >
> > Tested against SPEC CPU 2017, this change always has a net-positive
> > effect on the dynamic instruction count.  See the following table for
> > the breakdown on how this reduces the number of dynamic instructions
> > per workload on a like-for-like (with/without this commit):
> >
> > instruction count       base            with commit (commit-base)/commit
> > 502.gcc_r               98666845943     96891561634     -1.80%
> > 526.blender_r           6.21226E+11     6.12992E+11     -1.33%
> > 520.omnetpp_r           1.1241E+11      1.11093E+11     -1.17%
> > 500.perlbench_r         1271558717      1263268350      -0.65%
> > 523.xalancbmk_r         2.20103E+11     2.18836E+11     -0.58%
> > 531.deepsjeng_r         2.73591E+11     2.72114E+11     -0.54%
> > 500.perlbench_r         64195557393     63881512409     -0.49%
> > 541.leela_r             2.99097E+11     2.98245E+11     -0.29%
> > 548.exchange2_r         1.27976E+11     1.27784E+11     -0.15%
> > 527.cam4_r              88981458425     88887334679     -0.11%
> > 554.roms_r              2.60072E+11     2.59809E+11     -0.10%
> >
> > Collected spec2017 performance on ZNVER5, EMR and ICELAKE. No
> performance regression was observed.
> >
> > For O2 multi-copy :
> > 511.povray_r improved by 2.8% on ZNVER5.
> > 511.povray_r improved by 4% on EMR
> > 511.povray_r improved by 3.3 % ~ 4.6% on ICELAKE.
> >
> > gcc/ChangeLog:
> >
> >         * config/i386/i386-protos.h (ix86_get_separate_components):
> >         New function.
> >         (ix86_components_for_bb): Likewise.
> >         (ix86_disqualify_components): Likewise.
> >         (ix86_emit_prologue_components): Likewise.
> >         (ix86_emit_epilogue_components): Likewise.
> >         (ix86_set_handled_components): Likewise.
> >         * config/i386/i386.cc (save_regs_using_push_pop):
> >         Encapsulate code.
> 
> Better say: "Split from ix86_compute_frame_layout."
Done.

> 
> >         (ix86_compute_frame_layout):
> >         Handle save_regs_using_push_pop.
> 
> Use save_regs_using_push_pop.
Done.

> 
> >         (ix86_emit_save_regs_using_mov):
> >         Skip registers that are wrapped separately.
> >         (pro_epilogue_adjust_stack): Emit insn without clobber.
> >         (ix86_expand_prologue): Likewise.
> 
> Looks like "Likewise" applies to ix86_emit_save_regs_using_mov. There are 
> several
> other changes in ix86_expand_prologue. Please document them in the ChangeLog
> entry.
> 
> >         (ix86_emit_restore_regs_using_mov): Likewise.
> >         (ix86_expand_epilogue): Likewise.
> 
> Also in the above tow functions.
> 
Done.

> >         (ix86_get_separate_components): New function.
> >         (ix86_components_for_bb): Likewise.
> >         (ix86_disqualify_components): Likewise.
> >         (ix86_emit_prologue_components): Likewise.
> >         (ix86_emit_epilogue_components): Likewise.
> >         (ix86_set_handled_components): Likewise.
> >         (TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS): Define.
> >         (TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB): Likewise.
> >         (TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS): Likewise.
> >         (TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS): Likewise.
> >         (TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS): Likewise.
> >         (TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS): Likewise.
> >         * config/i386/i386.h (struct machine_function):Add
> >         reg_is_wrapped_separately array for register wrapping
> >         information.
> >
> > gcc/testsuite/ChangeLog:
> >
> >         * gcc.target/x86_64/abi/callabi/leaf-2.c: Adjust the test.
> >         * gcc.target/i386/interrupt-16.c: Likewise.
> >         * gfortran.dg/guality/arg1.f90: Likewise.
> >         * g++.target/i386/shrink_wrap_separate.C: New test.
> >         * gcc.target/i386/shrink_wrap_separate_check_lea.c: Likewise.
> >
> > Co-authored-by: Michael Matz <m...@suse.de>
> > ---
> >  gcc/config/i386/i386-protos.h                 |   7 +
> >  gcc/config/i386/i386.cc                       | 321 +++++++++++++++---
> >  gcc/config/i386/i386.h                        |   1 +
> >  .../g++.target/i386/shrink_wrap_separate.C    |  24 ++
> >  gcc/testsuite/gcc.target/i386/interrupt-16.c  |   4 +-
> >  .../i386/shrink_wrap_separate_check_lea.c     |  29 ++
> >  .../gcc.target/x86_64/abi/callabi/leaf-2.c    |   2 +-
> >  gcc/testsuite/gfortran.dg/guality/arg1.f90    |   2 +-
> >  8 files changed, 343 insertions(+), 47 deletions(-)  create mode
> > 100644 gcc/testsuite/g++.target/i386/shrink_wrap_separate.C
> >  create mode 100644
> > gcc/testsuite/gcc.target/i386/shrink_wrap_separate_check_lea.c
> >
> > diff --git a/gcc/config/i386/i386-protos.h
> > b/gcc/config/i386/i386-protos.h index 10863ab9e9d..86194b3d319 100644
> > --- a/gcc/config/i386/i386-protos.h
> > +++ b/gcc/config/i386/i386-protos.h
> > @@ -437,6 +437,13 @@ extern rtl_opt_pass *make_pass_align_tight_loops
> > (gcc::context *);  extern bool ix86_has_no_direct_extern_access;
> > extern bool ix86_rpad_gate ();
> >
> > +extern sbitmap ix86_get_separate_components (void); extern sbitmap
> > +ix86_components_for_bb (basic_block); extern void
> > +ix86_disqualify_components (sbitmap, edge, sbitmap, bool); extern
> > +void ix86_emit_prologue_components (sbitmap); extern void
> > +ix86_emit_epilogue_components (sbitmap); extern void
> > +ix86_set_handled_components (sbitmap);
> > +
> >  /* In i386-expand.cc.  */
> >  bool ix86_check_builtin_isa_match (unsigned int, HOST_WIDE_INT*,
> >                                    HOST_WIDE_INT*); diff --git
> > a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc index
> > d48654a729a..8f2193efe29 100644
> > --- a/gcc/config/i386/i386.cc
> > +++ b/gcc/config/i386/i386.cc
> > @@ -6905,6 +6905,26 @@ ix86_pro_and_epilogue_can_use_push2pop2 (int
> nregs)
> >          && (nregs + aligned) >= 3;
> >  }
> >
> > +/* Check if push/pop should be used to save/restore registers.  */
> > +static bool save_regs_using_push_pop (HOST_WIDE_INT to_allocate) {
> > +  return ((!to_allocate && cfun->machine->frame.nregs <= 1)
> > +         || (TARGET_64BIT && to_allocate >= HOST_WIDE_INT_C (0x80000000))
> > +         /* If static stack checking is enabled and done with probes,
> > +            the registers need to be saved before allocating the frame.  */
> > +         || flag_stack_check == STATIC_BUILTIN_STACK_CHECK
> > +         /* If stack clash probing needs a loop, then it needs a
> > +            scratch register.  But the returned register is only guaranteed
> > +            to be safe to use after register saves are complete.  So if
> > +            stack clash protections are enabled and the allocated frame is
> > +            larger than the probe interval, then use pushes to save
> > +            callee saved registers.  */
> > +         || (flag_stack_clash_protection
> > +             && !ix86_target_stack_probe ()
> > +             && to_allocate > get_probe_interval ())); }
> > +
> >  /* Fill structure ix86_frame about frame of currently computed
> > function.  */
> >
> >  static void
> > @@ -7189,20 +7209,7 @@ ix86_compute_frame_layout (void)
> >    /* Size prologue needs to allocate.  */
> >    to_allocate = offset - frame->sse_reg_save_offset;
> >
> > -  if ((!to_allocate && frame->nregs <= 1)
> > -      || (TARGET_64BIT && to_allocate >= HOST_WIDE_INT_C (0x80000000))
> > -       /* If static stack checking is enabled and done with probes,
> > -         the registers need to be saved before allocating the frame.  */
> > -      || flag_stack_check == STATIC_BUILTIN_STACK_CHECK
> > -      /* If stack clash probing needs a loop, then it needs a
> > -        scratch register.  But the returned register is only guaranteed
> > -        to be safe to use after register saves are complete.  So if
> > -        stack clash protections are enabled and the allocated frame is
> > -        larger than the probe interval, then use pushes to save
> > -        callee saved registers.  */
> > -      || (flag_stack_clash_protection
> > -         && !ix86_target_stack_probe ()
> > -         && to_allocate > get_probe_interval ()))
> > +  if (save_regs_using_push_pop (to_allocate))
> >      frame->save_regs_using_mov = false;
> >
> >    if (ix86_using_red_zone ()
> > @@ -7660,7 +7667,9 @@ ix86_emit_save_regs_using_mov (HOST_WIDE_INT
> cfa_offset)
> >    for (regno = 0; regno < FIRST_PSEUDO_REGISTER; regno++)
> >      if (GENERAL_REGNO_P (regno) && ix86_save_reg (regno, true, true))
> >        {
> > -        ix86_emit_save_reg_using_mov (word_mode, regno, cfa_offset);
> > +       /* Skip registers, already processed by shrink wrap separate.  */
> > +       if (!cfun->machine->reg_is_wrapped_separately[regno])
> > +         ix86_emit_save_reg_using_mov (word_mode, regno, cfa_offset);
> >         cfa_offset -= UNITS_PER_WORD;
> >        }
> >  }
> > @@ -7753,8 +7762,12 @@ pro_epilogue_adjust_stack (rtx dest, rtx src, rtx
> offset,
> >         add_frame_related_expr = true;
> >      }
> >
> > +  if (crtl->shrink_wrapped_separate)
> > +  insn = emit_insn (gen_rtx_SET (dest, gen_rtx_PLUS (Pmode, src,
> > + addend)));
> 
> Please use ix86_expand_binary_operator here, it will correctly generate LEA 
> during
> pro/epilogue generation. Also, please check indenting.
> 
Ok.

> > +  else
> >    insn = emit_insn (gen_pro_epilogue_adjust_stack_add
> >                     (Pmode, dest, src, addend));
> > +
> >    if (style >= 0)
> >      ix86_add_queued_cfa_restore_notes (insn);
> >
> > @@ -9218,13 +9231,30 @@ ix86_expand_prologue (void)
> >          the stack frame saving one cycle of the prologue.  However, avoid
> >          doing this if we have to probe the stack; at least on x86_64 the
> >          stack probe can turn into a call that clobbers a red zone 
> > location. */
> > -      else if (ix86_using_red_zone ()
> > -              && (! TARGET_STACK_PROBE
> > -                  || frame.stack_pointer_offset < CHECK_STACK_LIMIT))
> > +      else if ((ix86_using_red_zone ()
> > +               && (! TARGET_STACK_PROBE
> > +                   || frame.stack_pointer_offset < CHECK_STACK_LIMIT))
> > +              || crtl->shrink_wrapped_separate)
> >         {
> > +         HOST_WIDE_INT allocate_offset;
> > +         if (crtl->shrink_wrapped_separate)
> > +           {
> > +             allocate_offset = m->fs.sp_offset -
> > + frame.stack_pointer_offset;
> > +
> > +             /* Adjust the total offset at the beginning of the function.  
> > */
> > +             pro_epilogue_adjust_stack (stack_pointer_rtx, 
> > stack_pointer_rtx,
> > +                                        GEN_INT (allocate_offset), -1,
> > +                                        m->fs.cfa_reg == 
> > stack_pointer_rtx);
> > +             m->fs.sp_offset = cfun->machine->frame.stack_pointer_offset;
> > +           }
> > +
> >           ix86_emit_save_regs_using_mov (frame.reg_save_offset);
> > -         cfun->machine->red_zone_used = true;
> >           int_registers_saved = true;
> > +
> > +         if (ix86_using_red_zone ()
> > +             && (! TARGET_STACK_PROBE
> > +                 || frame.stack_pointer_offset < CHECK_STACK_LIMIT))
> > +           cfun->machine->red_zone_used = true;
> >         }
> >      }
> >
> > @@ -9344,6 +9374,7 @@ ix86_expand_prologue (void)
> >        && flag_stack_clash_protection
> >        && !ix86_target_stack_probe ())
> >      {
> > +      gcc_assert (!crtl->shrink_wrapped_separate);
> 
> Please put some vertical space after gcc_assert, so it will stand out more...
> 
Done.

> >        ix86_adjust_stack_and_probe (allocate, int_registers_saved, false);
> >        allocate = 0;
> >      }
> > @@ -9354,6 +9385,8 @@ ix86_expand_prologue (void)
> >      {
> >        const HOST_WIDE_INT probe_interval = get_probe_interval ();
> >
> > +      gcc_assert (!crtl->shrink_wrapped_separate);
> > +
> >        if (STACK_CHECK_MOVING_SP)
> >         {
> >           if (crtl->is_leaf
> > @@ -9410,12 +9443,14 @@ ix86_expand_prologue (void)
> >    else if (!ix86_target_stack_probe ()
> >            || frame.stack_pointer_offset < CHECK_STACK_LIMIT)
> >      {
> > +      gcc_assert (!crtl->shrink_wrapped_separate);
> 
> ... also here and in other places.
Done.

> > +/* Implement TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS.  */
> sbitmap
> > +ix86_get_separate_components (void) {
> > +  HOST_WIDE_INT offset, to_allocate;
> > +  sbitmap components = sbitmap_alloc (FIRST_PSEUDO_REGISTER);
> > +  bitmap_clear (components);
> > +  struct machine_function *m = cfun->machine;
> > +
> > +  offset = m->frame.stack_pointer_offset;  to_allocate = offset -
> > + m->frame.sse_reg_save_offset;
> > +
> > +  /* Shrink wrap separate uses MOV, which means APX PPX cannot be used.
> > +     Experiments show that APX PPX can speed up the prologue.  If the 
> > function
> > +     does not exit early during actual execution, then using APX PPX is 
> > faster.
> > +     If the function always exits early during actual execution, then 
> > shrink
> > +     wrap separate reduces the number of MOV (PUSH/POP) instructions 
> > actually
> > +     executed, thus speeding up execution.
> > +     foo:
> > +         movl    $1, %eax
> > +         testq   %rdi, %rdi
> > +         jne.L60
> > +         ret   ---> early return.
> > +     .L60:
> > +         subq    $88, %rsp     ---> belong to prologue.
> > +         xorl    %eax, %eax
> > +         movq    %rbx, 40 (%rsp) ---> belong to prologue.
> > +         movq    8 (%rdi), %rbx
> > +         movq    %rbp, 48 (%rsp) ---> belong to prologue.
> > +         movq    %rdi, %rbp
> > +         testq   %rbx, %rbx
> > +         jne.L61
> > +         movq    40 (%rsp), %rbx
> > +         movq    48 (%rsp), %rbp
> > +         addq    $88, %rsp
> > +         ret
> > +     .L61:
> > +         movq    %r12, 56 (%rsp) ---> belong to prologue.
> > +         movq    %r13, 64 (%rsp) ---> belong to prologue.
> > +         movq    %r14, 72 (%rsp) ---> belong to prologue.
> 
> Please remove leading spaces before tab in the above assembly. "git diff" 
> will mark
> this part in red.
> 
Done.

> > +     ... ...
> > +
> > +     It is a trade-off.  Disable shrink wrap separate when PPX is
> > + enabled.  */
> 
> No need to reiterate that this is a trade-off. You already explained it above 
> in great
> detail.
> 
Done.

> Please put a one line comment as is the case with the above functions.
> 
Done.

> > +void
> > +ix86_set_handled_components (sbitmap components) {
> > +  for (unsigned int regno = 0; regno < FIRST_PSEUDO_REGISTER; regno++)
> > +    if (bitmap_bit_p (components, regno))
> > +      {
> > +       cfun->machine->reg_is_wrapped_separately[regno] = true;
> > +       cfun->machine->use_fast_prologue_epilogue = true;
> > +       cfun->machine->frame.save_regs_using_mov = true;
> > +      }
> > +}
> > +
> > +#undef TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS
> > +#define TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS
> > +ix86_get_separate_components #undef
> > +TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB
> > +#define TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB
> ix86_components_for_bb
> > +#undef TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS
> > +#define TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS
> > +ix86_disqualify_components #undef
> > +TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS
> > +#define TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS \
> > +  ix86_emit_prologue_components
> > +#undef TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS
> > +#define TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS \
> > +  ix86_emit_epilogue_components
> > +#undef TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS
> > +#define TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS
> > +ix86_set_handled_components
> > +
> >  struct gcc_target targetm = TARGET_INITIALIZER;
> >
> >  #include "gt-i386.h"
> > diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h index
> > d32d9ad997e..5d4fd333169 100644
> > --- a/gcc/config/i386/i386.h
> > +++ b/gcc/config/i386/i386.h
> > @@ -2821,6 +2821,7 @@ struct GTY(()) machine_function {
> >    /* Cached initial frame layout for the current function.  */
> >    struct ix86_frame frame;
> 
> A short comment would be nice here...
Done.

> > +  bool reg_is_wrapped_separately[FIRST_PSEUDO_REGISTER];
> 
> ... and some vertical space here.
> 
Done.

> >    /* For -fsplit-stack support: A stack local which holds a pointer to
> >       the stack arguments for a function with a variable number of
> >       arguments.  This is set at the start of the function and is used
> > diff --git a/gcc/testsuite/g++.target/i386/shrink_wrap_separate.C
> > b/gcc/testsuite/g++.target/i386/shrink_wrap_separate.C
> > new file mode 100644
> > index 00000000000..31bd984df02
> > --- /dev/null
> > +++ b/gcc/testsuite/g++.target/i386/shrink_wrap_separate.C
> > @@ -0,0 +1,24 @@
> > +/* { dg-do compile { target { ! ia32 } } } */
> 
> Does new functionality also handle 32-bit targets? If this is the case, then 
> please
> use two scan-rtl-dumps, one for ia32 and the other for ! ia32. Don't disable 
> the
> test just because of a different string in the dump.
> 
You are right, I will fix this.

Thanks,
Lili.

RE: [PATCH V2] x86: Enable separate shrink wrapping

Reply via email to