From: Lili Cui <lili....@intel.com>

Hi Uros,

Thank you very much for providing detailed BKM to reproduce Linux kernel boot 
failure.  My patch and Matz's patch have this problem. We inserted a SUB 
between TEST and JLE, and the SUB changes the value of EFlags. The branch JLE 
here went wrong, and a null pointer appeared after passing through many 
functions. I created a small case for it and added it into my patch. I use lea 
instead of sub/add to fix this issue. Shrink-wrap also benefits from the lea 
instruction.

Real case:
Avoid inserting sub between test-je-jle to change EFlags, lea should be used 
here
        xorl    %eax, %eax
        testl   %edi, %edi
        je      .L11
        sub     $16, %rsp  ------> leaq    -16(%rsp), %rsp
        movq    %r13, 8(%rsp)
        movl    $1, %r13d
        jle     .L4

GCC change:
Emit lea in ix86_expand_prologue/ ix86_expand_epilogue, peephole2 will decide 
whether to replace lea with sub/add depending on the situation.


I learned Matz's patch, the effect we achieved is basically the same, although 
there are some differences in the implementation methods.

I incorporated some of Matz's good ideas into my patch.
1. Matz added a lot of assertions to detect errors in time, and I also added 
some in this patch.
2. Matz turned off shrink-wrap-separate judgments in some special scenarios, 
and I also added these judgments.

Still keep the difference:
1. The position of inserting SUB is different. Matz inserts SUB at the end of 
prolog, I put SUB at the first line of prologue. I keep my way here.
2. Matz handles both SSE and general registers, while I only handle general 
registers. Enabling SSE registers is easy, but the SSE part seems to be Windows 
ABI related. I have no way to test it yet, so I won't add it for now.
3.  After reloading, some registers were eliminated by optimization are still 
placed in regs_ever_live, resulting in redundant push/pop when generating 
prologue later, my patch will eliminate these redundant registers and keep the 
stack frame layout the same as before.

Collected spec2017 performance on ZNVER5, EMR and ICELAKE. No performance 
regression was observed.
For O2 multi-copy :
511.povray_r improved by 2.8% on ZNVER5.
511.povray_r improved by 4% on EMR
511.povray_r improved by 3.3 % ~ 4.6% on ICELAKE.

Bootstrapped & regtested on x86-64-pc-linux-gnu.
Use this patch to build the latest Linux kernel and boot successfully.

Thanks,
Lili.


This commit implements the target macros (TARGET_SHRINK_WRAP_*) that
enable separate shrink wrapping for function prologues/epilogues in
x86.

When performing separate shrink wrapping, we choose to use mov instead
of push/pop, because using push/pop is more complicated to handle rsp
adjustment and may lose performance, so here we choose to use mov, which
has a small impact on code size, but guarantees performance.

Using mov means we need to use sub/add to maintain the stack frame. In
some special cases, we need to use lea to prevent affecting EFlags.

Avoid inserting sub between test-je-jle to change EFlags, lea should be
used here.

    foo:
        xorl    %eax, %eax
        testl   %edi, %edi
        je      .L11
        sub     $16, %rsp  ------> leaq    -16(%rsp), %rsp
        movq    %r13, 8(%rsp)
        movl    $1, %r13d
        jle     .L4

Tested against SPEC CPU 2017, this change always has a net-positive
effect on the dynamic instruction count.  See the following table for
the breakdown on how this reduces the number of dynamic instructions
per workload on a like-for-like (with/without this commit):

instruction count       base            with commit (commit-base)/commit
502.gcc_r               98666845943     96891561634     -1.80%
526.blender_r           6.21226E+11     6.12992E+11     -1.33%
520.omnetpp_r           1.1241E+11      1.11093E+11     -1.17%
500.perlbench_r         1271558717      1263268350      -0.65%
523.xalancbmk_r         2.20103E+11     2.18836E+11     -0.58%
531.deepsjeng_r         2.73591E+11     2.72114E+11     -0.54%
500.perlbench_r         64195557393     63881512409     -0.49%
541.leela_r             2.99097E+11     2.98245E+11     -0.29%
548.exchange2_r         1.27976E+11     1.27784E+11     -0.15%
527.cam4_r              88981458425     88887334679     -0.11%
554.roms_r              2.60072E+11     2.59809E+11     -0.10%

Collected spec2017 performance on ZNVER5, EMR and ICELAKE. No performance 
regression was observed.

For O2 multi-copy :
511.povray_r improved by 2.8% on ZNVER5.
511.povray_r improved by 4% on EMR
511.povray_r improved by 3.3 % ~ 4.6% on ICELAKE.

gcc/ChangeLog:

        * config/i386/i386-protos.h (ix86_get_separate_components):
        New function.
        (ix86_components_for_bb): Likewise.
        (ix86_disqualify_components): Likewise.
        (ix86_emit_prologue_components): Likewise.
        (ix86_emit_epilogue_components): Likewise.
        (ix86_set_handled_components): Likewise.
        * config/i386/i386.cc (save_regs_using_push_pop):
        Encapsulate code.
        (ix86_compute_frame_layout):
        Handle save_regs_using_push_pop.
        (ix86_emit_save_regs_using_mov):
        Skip registers that are wrapped separately.
        (pro_epilogue_adjust_stack): Emit insn without clobber.
        (ix86_expand_prologue): Likewise.
        (ix86_emit_restore_regs_using_mov): Likewise.
        (ix86_expand_epilogue): Likewise.
        (ix86_get_separate_components): New function.
        (ix86_components_for_bb): Likewise.
        (ix86_disqualify_components): Likewise.
        (ix86_emit_prologue_components): Likewise.
        (ix86_emit_epilogue_components): Likewise.
        (ix86_set_handled_components): Likewise.
        (TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS): Define.
        (TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB): Likewise.
        (TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS): Likewise.
        (TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS): Likewise.
        (TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS): Likewise.
        (TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS): Likewise.
        * config/i386/i386.h (struct machine_function):Add
        reg_is_wrapped_separately array for register wrapping
        information.

gcc/testsuite/ChangeLog:

        * gcc.target/x86_64/abi/callabi/leaf-2.c: Adjust the test.
        * gcc.target/i386/interrupt-16.c: Likewise.
        * gfortran.dg/guality/arg1.f90: Likewise.
        * g++.target/i386/shrink_wrap_separate.C: New test.
        * gcc.target/i386/shrink_wrap_separate_check_lea.c: Likewise.

Co-authored-by: Michael Matz <m...@suse.de>
---
 gcc/config/i386/i386-protos.h                 |   7 +
 gcc/config/i386/i386.cc                       | 321 +++++++++++++++---
 gcc/config/i386/i386.h                        |   1 +
 .../g++.target/i386/shrink_wrap_separate.C    |  24 ++
 gcc/testsuite/gcc.target/i386/interrupt-16.c  |   4 +-
 .../i386/shrink_wrap_separate_check_lea.c     |  29 ++
 .../gcc.target/x86_64/abi/callabi/leaf-2.c    |   2 +-
 gcc/testsuite/gfortran.dg/guality/arg1.f90    |   2 +-
 8 files changed, 343 insertions(+), 47 deletions(-)
 create mode 100644 gcc/testsuite/g++.target/i386/shrink_wrap_separate.C
 create mode 100644 
gcc/testsuite/gcc.target/i386/shrink_wrap_separate_check_lea.c

diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h
index 10863ab9e9d..86194b3d319 100644
--- a/gcc/config/i386/i386-protos.h
+++ b/gcc/config/i386/i386-protos.h
@@ -437,6 +437,13 @@ extern rtl_opt_pass *make_pass_align_tight_loops 
(gcc::context *);
 extern bool ix86_has_no_direct_extern_access;
 extern bool ix86_rpad_gate ();
 
+extern sbitmap ix86_get_separate_components (void);
+extern sbitmap ix86_components_for_bb (basic_block);
+extern void ix86_disqualify_components (sbitmap, edge, sbitmap, bool);
+extern void ix86_emit_prologue_components (sbitmap);
+extern void ix86_emit_epilogue_components (sbitmap);
+extern void ix86_set_handled_components (sbitmap);
+
 /* In i386-expand.cc.  */
 bool ix86_check_builtin_isa_match (unsigned int, HOST_WIDE_INT*,
                                   HOST_WIDE_INT*);
diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc
index d48654a729a..8f2193efe29 100644
--- a/gcc/config/i386/i386.cc
+++ b/gcc/config/i386/i386.cc
@@ -6905,6 +6905,26 @@ ix86_pro_and_epilogue_can_use_push2pop2 (int nregs)
         && (nregs + aligned) >= 3;
 }
 
+/* Check if push/pop should be used to save/restore registers.  */
+static bool
+save_regs_using_push_pop (HOST_WIDE_INT to_allocate)
+{
+  return ((!to_allocate && cfun->machine->frame.nregs <= 1)
+         || (TARGET_64BIT && to_allocate >= HOST_WIDE_INT_C (0x80000000))
+         /* If static stack checking is enabled and done with probes,
+            the registers need to be saved before allocating the frame.  */
+         || flag_stack_check == STATIC_BUILTIN_STACK_CHECK
+         /* If stack clash probing needs a loop, then it needs a
+            scratch register.  But the returned register is only guaranteed
+            to be safe to use after register saves are complete.  So if
+            stack clash protections are enabled and the allocated frame is
+            larger than the probe interval, then use pushes to save
+            callee saved registers.  */
+         || (flag_stack_clash_protection
+             && !ix86_target_stack_probe ()
+             && to_allocate > get_probe_interval ()));
+}
+
 /* Fill structure ix86_frame about frame of currently computed function.  */
 
 static void
@@ -7189,20 +7209,7 @@ ix86_compute_frame_layout (void)
   /* Size prologue needs to allocate.  */
   to_allocate = offset - frame->sse_reg_save_offset;
 
-  if ((!to_allocate && frame->nregs <= 1)
-      || (TARGET_64BIT && to_allocate >= HOST_WIDE_INT_C (0x80000000))
-       /* If static stack checking is enabled and done with probes,
-         the registers need to be saved before allocating the frame.  */
-      || flag_stack_check == STATIC_BUILTIN_STACK_CHECK
-      /* If stack clash probing needs a loop, then it needs a
-        scratch register.  But the returned register is only guaranteed
-        to be safe to use after register saves are complete.  So if
-        stack clash protections are enabled and the allocated frame is
-        larger than the probe interval, then use pushes to save
-        callee saved registers.  */
-      || (flag_stack_clash_protection
-         && !ix86_target_stack_probe ()
-         && to_allocate > get_probe_interval ()))
+  if (save_regs_using_push_pop (to_allocate))
     frame->save_regs_using_mov = false;
 
   if (ix86_using_red_zone ()
@@ -7660,7 +7667,9 @@ ix86_emit_save_regs_using_mov (HOST_WIDE_INT cfa_offset)
   for (regno = 0; regno < FIRST_PSEUDO_REGISTER; regno++)
     if (GENERAL_REGNO_P (regno) && ix86_save_reg (regno, true, true))
       {
-        ix86_emit_save_reg_using_mov (word_mode, regno, cfa_offset);
+       /* Skip registers, already processed by shrink wrap separate.  */
+       if (!cfun->machine->reg_is_wrapped_separately[regno])
+         ix86_emit_save_reg_using_mov (word_mode, regno, cfa_offset);
        cfa_offset -= UNITS_PER_WORD;
       }
 }
@@ -7753,8 +7762,12 @@ pro_epilogue_adjust_stack (rtx dest, rtx src, rtx offset,
        add_frame_related_expr = true;
     }
 
+  if (crtl->shrink_wrapped_separate)
+  insn = emit_insn (gen_rtx_SET (dest, gen_rtx_PLUS (Pmode, src, addend)));
+  else
   insn = emit_insn (gen_pro_epilogue_adjust_stack_add
                    (Pmode, dest, src, addend));
+
   if (style >= 0)
     ix86_add_queued_cfa_restore_notes (insn);
 
@@ -9218,13 +9231,30 @@ ix86_expand_prologue (void)
         the stack frame saving one cycle of the prologue.  However, avoid
         doing this if we have to probe the stack; at least on x86_64 the
         stack probe can turn into a call that clobbers a red zone location. */
-      else if (ix86_using_red_zone ()
-              && (! TARGET_STACK_PROBE
-                  || frame.stack_pointer_offset < CHECK_STACK_LIMIT))
+      else if ((ix86_using_red_zone ()
+               && (! TARGET_STACK_PROBE
+                   || frame.stack_pointer_offset < CHECK_STACK_LIMIT))
+              || crtl->shrink_wrapped_separate)
        {
+         HOST_WIDE_INT allocate_offset;
+         if (crtl->shrink_wrapped_separate)
+           {
+             allocate_offset = m->fs.sp_offset - frame.stack_pointer_offset;
+
+             /* Adjust the total offset at the beginning of the function.  */
+             pro_epilogue_adjust_stack (stack_pointer_rtx, stack_pointer_rtx,
+                                        GEN_INT (allocate_offset), -1,
+                                        m->fs.cfa_reg == stack_pointer_rtx);
+             m->fs.sp_offset = cfun->machine->frame.stack_pointer_offset;
+           }
+
          ix86_emit_save_regs_using_mov (frame.reg_save_offset);
-         cfun->machine->red_zone_used = true;
          int_registers_saved = true;
+
+         if (ix86_using_red_zone ()
+             && (! TARGET_STACK_PROBE
+                 || frame.stack_pointer_offset < CHECK_STACK_LIMIT))
+           cfun->machine->red_zone_used = true;
        }
     }
 
@@ -9344,6 +9374,7 @@ ix86_expand_prologue (void)
       && flag_stack_clash_protection
       && !ix86_target_stack_probe ())
     {
+      gcc_assert (!crtl->shrink_wrapped_separate);
       ix86_adjust_stack_and_probe (allocate, int_registers_saved, false);
       allocate = 0;
     }
@@ -9354,6 +9385,8 @@ ix86_expand_prologue (void)
     {
       const HOST_WIDE_INT probe_interval = get_probe_interval ();
 
+      gcc_assert (!crtl->shrink_wrapped_separate);
+
       if (STACK_CHECK_MOVING_SP)
        {
          if (crtl->is_leaf
@@ -9410,12 +9443,14 @@ ix86_expand_prologue (void)
   else if (!ix86_target_stack_probe ()
           || frame.stack_pointer_offset < CHECK_STACK_LIMIT)
     {
+      gcc_assert (!crtl->shrink_wrapped_separate);
       pro_epilogue_adjust_stack (stack_pointer_rtx, stack_pointer_rtx,
                                 GEN_INT (-allocate), -1,
                                 m->fs.cfa_reg == stack_pointer_rtx);
     }
   else
     {
+      gcc_assert (!crtl->shrink_wrapped_separate);
       rtx eax = gen_rtx_REG (Pmode, AX_REG);
       rtx r10 = NULL;
       const bool sp_is_cfa_reg = (m->fs.cfa_reg == stack_pointer_rtx);
@@ -9801,30 +9836,35 @@ ix86_emit_restore_regs_using_mov (HOST_WIDE_INT 
cfa_offset,
   for (regno = 0; regno < FIRST_PSEUDO_REGISTER; regno++)
     if (GENERAL_REGNO_P (regno) && ix86_save_reg (regno, maybe_eh_return, 
true))
       {
-       rtx reg = gen_rtx_REG (word_mode, regno);
-       rtx mem;
-       rtx_insn *insn;
-
-       mem = choose_baseaddr (cfa_offset, NULL);
-       mem = gen_frame_mem (word_mode, mem);
-       insn = emit_move_insn (reg, mem);
 
-        if (m->fs.cfa_reg == crtl->drap_reg && regno == REGNO (crtl->drap_reg))
+       /* Skip registers, already processed by shrink wrap separate.  */
+       if (!cfun->machine->reg_is_wrapped_separately[regno])
          {
-           /* Previously we'd represented the CFA as an expression
-              like *(%ebp - 8).  We've just popped that value from
-              the stack, which means we need to reset the CFA to
-              the drap register.  This will remain until we restore
-              the stack pointer.  */
-           add_reg_note (insn, REG_CFA_DEF_CFA, reg);
-           RTX_FRAME_RELATED_P (insn) = 1;
+           rtx reg = gen_rtx_REG (word_mode, regno);
+           rtx mem;
+           rtx_insn *insn;
 
-           /* This means that the DRAP register is valid for addressing.  */
-           m->fs.drap_valid = true;
-         }
-       else
-         ix86_add_cfa_restore_note (NULL, reg, cfa_offset);
+           mem = choose_baseaddr (cfa_offset, NULL);
+           mem = gen_frame_mem (word_mode, mem);
+           insn = emit_move_insn (reg, mem);
 
+           if (m->fs.cfa_reg == crtl->drap_reg
+               && regno == REGNO (crtl->drap_reg))
+             {
+               /* Previously we'd represented the CFA as an expression
+                  like *(%ebp - 8).  We've just popped that value from
+                  the stack, which means we need to reset the CFA to
+                  the drap register.  This will remain until we restore
+                  the stack pointer.  */
+               add_reg_note (insn, REG_CFA_DEF_CFA, reg);
+               RTX_FRAME_RELATED_P (insn) = 1;
+
+               /* DRAP register is valid for addressing.  */
+               m->fs.drap_valid = true;
+             }
+           else
+             ix86_add_cfa_restore_note (NULL, reg, cfa_offset);
+         }
        cfa_offset -= UNITS_PER_WORD;
       }
 }
@@ -10103,10 +10143,11 @@ ix86_expand_epilogue (int style)
      less work than reloading sp and popping the register.  */
   else if (!sp_valid_at (frame.hfp_save_offset) && frame.nregs <= 1)
     restore_regs_via_mov = true;
-  else if (TARGET_EPILOGUE_USING_MOVE
-          && cfun->machine->use_fast_prologue_epilogue
-          && (frame.nregs > 1
-              || m->fs.sp_offset != reg_save_offset))
+  else if (crtl->shrink_wrapped_separate
+          || (TARGET_EPILOGUE_USING_MOVE
+              && cfun->machine->use_fast_prologue_epilogue
+              && (frame.nregs > 1
+                  || m->fs.sp_offset != reg_save_offset)))
     restore_regs_via_mov = true;
   else if (frame_pointer_needed
           && !frame.nregs
@@ -10120,6 +10161,9 @@ ix86_expand_epilogue (int style)
   else
     restore_regs_via_mov = false;
 
+  if (crtl->shrink_wrapped_separate)
+    gcc_assert (restore_regs_via_mov);
+
   if (restore_regs_via_mov || frame.nsseregs)
     {
       /* Ensure that the entire register save area is addressable via
@@ -10172,6 +10216,7 @@ ix86_expand_epilogue (int style)
       gcc_assert (m->fs.sp_offset == UNITS_PER_WORD);
       gcc_assert (!crtl->drap_reg);
       gcc_assert (!frame.nregs);
+      gcc_assert (!crtl->shrink_wrapped_separate);
     }
   else if (restore_regs_via_mov)
     {
@@ -10186,6 +10231,8 @@ ix86_expand_epilogue (int style)
          rtx sa = EH_RETURN_STACKADJ_RTX;
          rtx_insn *insn;
 
+         gcc_assert (!crtl->shrink_wrapped_separate);
+
          /* Stack realignment doesn't work with eh_return.  */
          if (crtl->stack_realign_needed)
            sorry ("Stack realignment not supported with "
@@ -28066,6 +28113,194 @@ ix86_cannot_copy_insn_p (rtx_insn *insn)
 #undef TARGET_DOCUMENTATION_NAME
 #define TARGET_DOCUMENTATION_NAME "x86"
 
+/* Implement TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS.  */
+sbitmap
+ix86_get_separate_components (void)
+{
+  HOST_WIDE_INT offset, to_allocate;
+  sbitmap components = sbitmap_alloc (FIRST_PSEUDO_REGISTER);
+  bitmap_clear (components);
+  struct machine_function *m = cfun->machine;
+
+  offset = m->frame.stack_pointer_offset;
+  to_allocate = offset - m->frame.sse_reg_save_offset;
+
+  /* Shrink wrap separate uses MOV, which means APX PPX cannot be used.
+     Experiments show that APX PPX can speed up the prologue.  If the function
+     does not exit early during actual execution, then using APX PPX is faster.
+     If the function always exits early during actual execution, then shrink
+     wrap separate reduces the number of MOV (PUSH/POP) instructions actually
+     executed, thus speeding up execution.
+     foo:
+         movl    $1, %eax
+         testq   %rdi, %rdi
+         jne.L60
+         ret   ---> early return.
+     .L60:
+         subq    $88, %rsp     ---> belong to prologue.
+         xorl    %eax, %eax
+         movq    %rbx, 40 (%rsp) ---> belong to prologue.
+         movq    8 (%rdi), %rbx
+         movq    %rbp, 48 (%rsp) ---> belong to prologue.
+         movq    %rdi, %rbp
+         testq   %rbx, %rbx
+         jne.L61
+         movq    40 (%rsp), %rbx
+         movq    48 (%rsp), %rbp
+         addq    $88, %rsp
+         ret
+     .L61:
+         movq    %r12, 56 (%rsp) ---> belong to prologue.
+         movq    %r13, 64 (%rsp) ---> belong to prologue.
+         movq    %r14, 72 (%rsp) ---> belong to prologue.
+     ... ...
+
+     It is a trade-off.  Disable shrink wrap separate when PPX is enabled.  */
+  if ((TARGET_APX_PPX && !crtl->calls_eh_return)
+      || cfun->machine->func_type != TYPE_NORMAL
+      || TARGET_SEH
+      || crtl->stack_realign_needed
+      || m->call_ms2sysv)
+    return components;
+
+  /* Since shrink wrapping separate uses MOV instead of PUSH/POP.
+     Disable shrink wrap separate when MOV is prohibited.  */
+  if (save_regs_using_push_pop (to_allocate))
+    return components;
+
+  for (unsigned int regno = 0; regno < FIRST_PSEUDO_REGISTER; regno++)
+    if (GENERAL_REGNO_P (regno) && ix86_save_reg (regno, true, true))
+      {
+       /* Skip registers with large offsets, where a pseudo may be needed.  */
+       if (IN_RANGE (offset, -0x8000, 0x7fff))
+         bitmap_set_bit (components, regno);
+       offset += UNITS_PER_WORD;
+      }
+
+  /* Don't mess with the following registers.  */
+  if (frame_pointer_needed)
+    bitmap_clear_bit (components, HARD_FRAME_POINTER_REGNUM);
+
+  if (crtl->drap_reg)
+    bitmap_clear_bit (components, REGNO (crtl->drap_reg));
+
+  if (pic_offset_table_rtx)
+    bitmap_clear_bit (components, REAL_PIC_OFFSET_TABLE_REGNUM);
+
+  return components;
+}
+
+/* Implement TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB.  */
+sbitmap
+ix86_components_for_bb (basic_block bb)
+{
+  bitmap in = DF_LIVE_IN (bb);
+  bitmap gen = &DF_LIVE_BB_INFO (bb)->gen;
+  bitmap kill = &DF_LIVE_BB_INFO (bb)->kill;
+
+  sbitmap components = sbitmap_alloc (FIRST_PSEUDO_REGISTER);
+  bitmap_clear (components);
+
+  function_abi_aggregator callee_abis;
+  rtx_insn *insn;
+  FOR_BB_INSNS (bb, insn)
+    if (CALL_P (insn))
+      callee_abis.note_callee_abi (insn_callee_abi (insn));
+  HARD_REG_SET extra_caller_saves = callee_abis.caller_save_regs (*crtl->abi);
+
+  /* GPRs are used in a bb if they are in the IN, GEN, or KILL sets.  */
+  for (unsigned int regno = 0; regno < FIRST_PSEUDO_REGISTER; regno++)
+    if (!fixed_regs[regno]
+       && (TEST_HARD_REG_BIT (extra_caller_saves, regno)
+           || bitmap_bit_p (in, regno)
+           || bitmap_bit_p (gen, regno)
+           || bitmap_bit_p (kill, regno)))
+      bitmap_set_bit (components, regno);
+
+  return components;
+}
+
+/* Implement TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS.  */
+void
+ix86_disqualify_components (sbitmap, edge, sbitmap, bool)
+{
+  /* Nothing to do for x86.  */
+}
+
+/* Implement TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS.  */
+void
+ix86_emit_prologue_components (sbitmap components)
+{
+  HOST_WIDE_INT cfa_offset;
+  struct machine_function *m = cfun->machine;
+
+  cfa_offset = m->frame.reg_save_offset + m->fs.sp_offset
+              - m->frame.stack_pointer_offset;
+  for (unsigned int regno = 0; regno < FIRST_PSEUDO_REGISTER; regno++)
+    if (GENERAL_REGNO_P (regno) && ix86_save_reg (regno, true, true))
+      {
+       if (bitmap_bit_p (components, regno))
+         ix86_emit_save_reg_using_mov (word_mode, regno, cfa_offset);
+       cfa_offset -= UNITS_PER_WORD;
+      }
+}
+
+/* Implement TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS.  */
+void
+ix86_emit_epilogue_components (sbitmap components)
+{
+  HOST_WIDE_INT cfa_offset;
+  struct machine_function *m = cfun->machine;
+  cfa_offset = m->frame.reg_save_offset + m->fs.sp_offset
+              - m->frame.stack_pointer_offset;
+
+  for (unsigned int regno = 0; regno < FIRST_PSEUDO_REGISTER; regno++)
+    if (GENERAL_REGNO_P (regno) && ix86_save_reg (regno, true, true))
+      {
+       if (bitmap_bit_p (components, regno))
+         {
+           rtx reg = gen_rtx_REG (word_mode, regno);
+           rtx mem;
+           rtx_insn *insn;
+
+           mem = choose_baseaddr (cfa_offset, NULL);
+           mem = gen_frame_mem (word_mode, mem);
+           insn = emit_move_insn (reg, mem);
+
+           RTX_FRAME_RELATED_P (insn) = 1;
+           add_reg_note (insn, REG_CFA_RESTORE, reg);
+         }
+       cfa_offset -= UNITS_PER_WORD;
+      }
+}
+
+void
+ix86_set_handled_components (sbitmap components)
+{
+  for (unsigned int regno = 0; regno < FIRST_PSEUDO_REGISTER; regno++)
+    if (bitmap_bit_p (components, regno))
+      {
+       cfun->machine->reg_is_wrapped_separately[regno] = true;
+       cfun->machine->use_fast_prologue_epilogue = true;
+       cfun->machine->frame.save_regs_using_mov = true;
+      }
+}
+
+#undef TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS
+#define TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS ix86_get_separate_components
+#undef TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB
+#define TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB ix86_components_for_bb
+#undef TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS
+#define TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS ix86_disqualify_components
+#undef TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS
+#define TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS \
+  ix86_emit_prologue_components
+#undef TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS
+#define TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS \
+  ix86_emit_epilogue_components
+#undef TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS
+#define TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS ix86_set_handled_components
+
 struct gcc_target targetm = TARGET_INITIALIZER;
 
 #include "gt-i386.h"
diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h
index d32d9ad997e..5d4fd333169 100644
--- a/gcc/config/i386/i386.h
+++ b/gcc/config/i386/i386.h
@@ -2821,6 +2821,7 @@ struct GTY(()) machine_function {
   /* Cached initial frame layout for the current function.  */
   struct ix86_frame frame;
 
+  bool reg_is_wrapped_separately[FIRST_PSEUDO_REGISTER];
   /* For -fsplit-stack support: A stack local which holds a pointer to
      the stack arguments for a function with a variable number of
      arguments.  This is set at the start of the function and is used
diff --git a/gcc/testsuite/g++.target/i386/shrink_wrap_separate.C 
b/gcc/testsuite/g++.target/i386/shrink_wrap_separate.C
new file mode 100644
index 00000000000..31bd984df02
--- /dev/null
+++ b/gcc/testsuite/g++.target/i386/shrink_wrap_separate.C
@@ -0,0 +1,24 @@
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -fdump-rtl-pro_and_epilogue" } */
+typedef struct a b;
+typedef double c;
+struct a {
+  b *d;
+  b *e;
+};
+struct f {
+  c g;
+};
+inline bool h(c i, b *m) {
+  b *j = m->e;
+  for (; m->e; j = j->d)
+    if (h(i, j))
+      return 0;
+  return m;
+}
+bool k() {
+  f *l;
+  b *n;
+ return h(l->g, n);
+}
+/* { dg-final { scan-rtl-dump "The components we wrap separately are \\\[sep 
40 41 42 43\\\]" "pro_and_epilogue" } } */
diff --git a/gcc/testsuite/gcc.target/i386/interrupt-16.c 
b/gcc/testsuite/gcc.target/i386/interrupt-16.c
index cb45ba54e3d..ca4441b3aee 100644
--- a/gcc/testsuite/gcc.target/i386/interrupt-16.c
+++ b/gcc/testsuite/gcc.target/i386/interrupt-16.c
@@ -18,5 +18,5 @@ foo (int i)
 /* { dg-final { scan-assembler-not "(push|pop)(l|q)\[\\t \]*%(r|e)bp" } } */
 /* { dg-final { scan-assembler-not "(push|pop)l\[\\t \]*%edi" { target ia32 } 
} } */
 /* { dg-final { scan-assembler-not "(push|pop)q\[\\t \]*%r\[0-9\]+" { target { 
! ia32 } } } } */
-/* { dg-final { scan-assembler-times "pushq\[\\t \]*%rdi" 1 { target { ! ia32 
} } } } */
-/* { dg-final { scan-assembler-times "popq\[\\t \]*%rdi" 1 { target { ! ia32 } 
} } } */
+/* { dg-final { scan-assembler-times "(pushq.*%rdi|subq.*\\\$8,.*%rsp)" 1 { 
target { ! ia32 } } } } */
+/* { dg-final { scan-assembler-times "(popq.*%rdi|addq.*\\\$8,.*%rsp)" 1 { 
target { ! ia32 } } } } */
diff --git a/gcc/testsuite/gcc.target/i386/shrink_wrap_separate_check_lea.c 
b/gcc/testsuite/gcc.target/i386/shrink_wrap_separate_check_lea.c
new file mode 100644
index 00000000000..0f2449f68b6
--- /dev/null
+++ b/gcc/testsuite/gcc.target/i386/shrink_wrap_separate_check_lea.c
@@ -0,0 +1,29 @@
+/* { dg-do compile { target { ! ia32 } } } */
+/* { dg-options "-O2 -fdump-rtl-pro_and_epilogue" } */
+
+/* Avoid inserting sub between test-je-jle to change EFlags, lea should be 
used here
+        xorl    %eax, %eax
+        testl   %edi, %edi
+        je      .L11
+        sub     $16, %rsp  ------> leaq    -16(%rsp), %rsp
+        movq    %r13, 8(%rsp)
+        movl    $1, %r13d
+        jle     .L4
+*/
+int foo (int num)
+{
+  if (!num)
+    return 0;
+
+  register int r13 __asm ("r13") = 1;
+
+  for ( int i = 0; i < num; i++)
+    {
+      register int r12 __asm ("r12") = 1;
+      asm volatile ("" : "+r" (r12), "+r" (r13));
+    }
+
+  return 1;
+}
+/* { dg-final { scan-rtl-dump "The components we wrap separately are \\\[sep 
40\\\]" "pro_and_epilogue" } } */
+/* { dg-final { scan-assembler "leaq.*(%rsp)" } } */
diff --git a/gcc/testsuite/gcc.target/x86_64/abi/callabi/leaf-2.c 
b/gcc/testsuite/gcc.target/x86_64/abi/callabi/leaf-2.c
index 5f3d3e166af..46fc4648dbd 100644
--- a/gcc/testsuite/gcc.target/x86_64/abi/callabi/leaf-2.c
+++ b/gcc/testsuite/gcc.target/x86_64/abi/callabi/leaf-2.c
@@ -1,5 +1,5 @@
 /* { dg-do compile } */
-/* { dg-options "-O2 -fno-tree-vectorize -mabi=sysv" } */
+/* { dg-options "-O2 -fno-tree-vectorize -mabi=sysv -fno-shrink-wrap-separate" 
} */
 
 extern int glb1, gbl2, gbl3;
 
diff --git a/gcc/testsuite/gfortran.dg/guality/arg1.f90 
b/gcc/testsuite/gfortran.dg/guality/arg1.f90
index 332a4ed1d87..775b7bb304f 100644
--- a/gcc/testsuite/gfortran.dg/guality/arg1.f90
+++ b/gcc/testsuite/gfortran.dg/guality/arg1.f90
@@ -1,5 +1,5 @@
 ! { dg-do run }
-! { dg-options "-g" }
+! { dg-options "-fno-shrink-wrap -g" }
   integer :: a(10), b(12)
   call sub (a, 10)
   call sub (b, 12)
-- 
2.34.1

Reply via email to