On Wed, Jun 11, 2025 at 5:33 AM Cui, Lili <lili....@intel.com> wrote: > > From: Lili Cui <lili....@intel.com> > > Hi Uros, > > Thank you very much for providing detailed BKM to reproduce Linux kernel boot > failure. My patch and Matz's patch have this problem. We inserted a SUB > between TEST and JLE, and the SUB changes the value of EFlags. The branch JLE > here went wrong, and a null pointer appeared after passing through many > functions. I created a small case for it and added it into my patch. I use > lea instead of sub/add to fix this issue. Shrink-wrap also benefits from the > lea instruction.
Thanks for your thorough testing! I think that introducing a boot test for the linux kernel (besides SPEC tests) in your testing is a great addition, and I would recommend to perform it especially for an invasive change like this. In addition, Intel has a continuous integration testing for the linux kernel patches, perhaps something similar can be used to boot and test linux with GCC changes. > Real case: > Avoid inserting sub between test-je-jle to change EFlags, lea should be used > here > xorl %eax, %eax > testl %edi, %edi > je .L11 > sub $16, %rsp ------> leaq -16(%rsp), %rsp FAOD, you mean to insert SUB between TESTL and JE? The above case is not problematic, because flags are produced and consumed without being clobbered. > movq %r13, 8(%rsp) > movl $1, %r13d > jle .L4 > > GCC change: > Emit lea in ix86_expand_prologue/ ix86_expand_epilogue, peephole2 will decide > whether to replace lea with sub/add depending on the situation. > > > I learned Matz's patch, the effect we achieved is basically the same, > although there are some differences in the implementation methods. > > I incorporated some of Matz's good ideas into my patch. > 1. Matz added a lot of assertions to detect errors in time, and I also added > some in this patch. > 2. Matz turned off shrink-wrap-separate judgments in some special scenarios, > and I also added these judgments. > > Still keep the difference: > 1. The position of inserting SUB is different. Matz inserts SUB at the end of > prolog, I put SUB at the first line of prologue. I keep my way here. > 2. Matz handles both SSE and general registers, while I only handle general > registers. Enabling SSE registers is easy, but the SSE part seems to be > Windows ABI related. I have no way to test it yet, so I won't add it for now. > 3. After reloading, some registers were eliminated by optimization are still > placed in regs_ever_live, resulting in redundant push/pop when generating > prologue later, my patch will eliminate these redundant registers and keep > the stack frame layout the same as before. > > Collected spec2017 performance on ZNVER5, EMR and ICELAKE. No performance > regression was observed. > For O2 multi-copy : > 511.povray_r improved by 2.8% on ZNVER5. > 511.povray_r improved by 4% on EMR > 511.povray_r improved by 3.3 % ~ 4.6% on ICELAKE. > > Bootstrapped & regtested on x86-64-pc-linux-gnu. > Use this patch to build the latest Linux kernel and boot successfully. > > Thanks, > Lili. > > > This commit implements the target macros (TARGET_SHRINK_WRAP_*) that > enable separate shrink wrapping for function prologues/epilogues in > x86. > > When performing separate shrink wrapping, we choose to use mov instead > of push/pop, because using push/pop is more complicated to handle rsp > adjustment and may lose performance, so here we choose to use mov, which > has a small impact on code size, but guarantees performance. > > Using mov means we need to use sub/add to maintain the stack frame. In > some special cases, we need to use lea to prevent affecting EFlags. > > Avoid inserting sub between test-je-jle to change EFlags, lea should be > used here. > > foo: > xorl %eax, %eax > testl %edi, %edi > je .L11 > sub $16, %rsp ------> leaq -16(%rsp), %rsp Also here, is the placement of SUB correct? > movq %r13, 8(%rsp) > movl $1, %r13d > jle .L4 > > Tested against SPEC CPU 2017, this change always has a net-positive > effect on the dynamic instruction count. See the following table for > the breakdown on how this reduces the number of dynamic instructions > per workload on a like-for-like (with/without this commit): > > instruction count base with commit (commit-base)/commit > 502.gcc_r 98666845943 96891561634 -1.80% > 526.blender_r 6.21226E+11 6.12992E+11 -1.33% > 520.omnetpp_r 1.1241E+11 1.11093E+11 -1.17% > 500.perlbench_r 1271558717 1263268350 -0.65% > 523.xalancbmk_r 2.20103E+11 2.18836E+11 -0.58% > 531.deepsjeng_r 2.73591E+11 2.72114E+11 -0.54% > 500.perlbench_r 64195557393 63881512409 -0.49% > 541.leela_r 2.99097E+11 2.98245E+11 -0.29% > 548.exchange2_r 1.27976E+11 1.27784E+11 -0.15% > 527.cam4_r 88981458425 88887334679 -0.11% > 554.roms_r 2.60072E+11 2.59809E+11 -0.10% > > Collected spec2017 performance on ZNVER5, EMR and ICELAKE. No performance > regression was observed. > > For O2 multi-copy : > 511.povray_r improved by 2.8% on ZNVER5. > 511.povray_r improved by 4% on EMR > 511.povray_r improved by 3.3 % ~ 4.6% on ICELAKE. > > gcc/ChangeLog: > > * config/i386/i386-protos.h (ix86_get_separate_components): > New function. > (ix86_components_for_bb): Likewise. > (ix86_disqualify_components): Likewise. > (ix86_emit_prologue_components): Likewise. > (ix86_emit_epilogue_components): Likewise. > (ix86_set_handled_components): Likewise. > * config/i386/i386.cc (save_regs_using_push_pop): > Encapsulate code. Better say: "Split from ix86_compute_frame_layout." > (ix86_compute_frame_layout): > Handle save_regs_using_push_pop. Use save_regs_using_push_pop. > (ix86_emit_save_regs_using_mov): > Skip registers that are wrapped separately. > (pro_epilogue_adjust_stack): Emit insn without clobber. > (ix86_expand_prologue): Likewise. Looks like "Likewise" applies to ix86_emit_save_regs_using_mov. There are several other changes in ix86_expand_prologue. Please document them in the ChangeLog entry. > (ix86_emit_restore_regs_using_mov): Likewise. > (ix86_expand_epilogue): Likewise. Also in the above tow functions. > (ix86_get_separate_components): New function. > (ix86_components_for_bb): Likewise. > (ix86_disqualify_components): Likewise. > (ix86_emit_prologue_components): Likewise. > (ix86_emit_epilogue_components): Likewise. > (ix86_set_handled_components): Likewise. > (TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS): Define. > (TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB): Likewise. > (TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS): Likewise. > (TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS): Likewise. > (TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS): Likewise. > (TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS): Likewise. > * config/i386/i386.h (struct machine_function):Add > reg_is_wrapped_separately array for register wrapping > information. > > gcc/testsuite/ChangeLog: > > * gcc.target/x86_64/abi/callabi/leaf-2.c: Adjust the test. > * gcc.target/i386/interrupt-16.c: Likewise. > * gfortran.dg/guality/arg1.f90: Likewise. > * g++.target/i386/shrink_wrap_separate.C: New test. > * gcc.target/i386/shrink_wrap_separate_check_lea.c: Likewise. > > Co-authored-by: Michael Matz <m...@suse.de> > --- > gcc/config/i386/i386-protos.h | 7 + > gcc/config/i386/i386.cc | 321 +++++++++++++++--- > gcc/config/i386/i386.h | 1 + > .../g++.target/i386/shrink_wrap_separate.C | 24 ++ > gcc/testsuite/gcc.target/i386/interrupt-16.c | 4 +- > .../i386/shrink_wrap_separate_check_lea.c | 29 ++ > .../gcc.target/x86_64/abi/callabi/leaf-2.c | 2 +- > gcc/testsuite/gfortran.dg/guality/arg1.f90 | 2 +- > 8 files changed, 343 insertions(+), 47 deletions(-) > create mode 100644 gcc/testsuite/g++.target/i386/shrink_wrap_separate.C > create mode 100644 > gcc/testsuite/gcc.target/i386/shrink_wrap_separate_check_lea.c > > diff --git a/gcc/config/i386/i386-protos.h b/gcc/config/i386/i386-protos.h > index 10863ab9e9d..86194b3d319 100644 > --- a/gcc/config/i386/i386-protos.h > +++ b/gcc/config/i386/i386-protos.h > @@ -437,6 +437,13 @@ extern rtl_opt_pass *make_pass_align_tight_loops > (gcc::context *); > extern bool ix86_has_no_direct_extern_access; > extern bool ix86_rpad_gate (); > > +extern sbitmap ix86_get_separate_components (void); > +extern sbitmap ix86_components_for_bb (basic_block); > +extern void ix86_disqualify_components (sbitmap, edge, sbitmap, bool); > +extern void ix86_emit_prologue_components (sbitmap); > +extern void ix86_emit_epilogue_components (sbitmap); > +extern void ix86_set_handled_components (sbitmap); > + > /* In i386-expand.cc. */ > bool ix86_check_builtin_isa_match (unsigned int, HOST_WIDE_INT*, > HOST_WIDE_INT*); > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc > index d48654a729a..8f2193efe29 100644 > --- a/gcc/config/i386/i386.cc > +++ b/gcc/config/i386/i386.cc > @@ -6905,6 +6905,26 @@ ix86_pro_and_epilogue_can_use_push2pop2 (int nregs) > && (nregs + aligned) >= 3; > } > > +/* Check if push/pop should be used to save/restore registers. */ > +static bool > +save_regs_using_push_pop (HOST_WIDE_INT to_allocate) > +{ > + return ((!to_allocate && cfun->machine->frame.nregs <= 1) > + || (TARGET_64BIT && to_allocate >= HOST_WIDE_INT_C (0x80000000)) > + /* If static stack checking is enabled and done with probes, > + the registers need to be saved before allocating the frame. */ > + || flag_stack_check == STATIC_BUILTIN_STACK_CHECK > + /* If stack clash probing needs a loop, then it needs a > + scratch register. But the returned register is only guaranteed > + to be safe to use after register saves are complete. So if > + stack clash protections are enabled and the allocated frame is > + larger than the probe interval, then use pushes to save > + callee saved registers. */ > + || (flag_stack_clash_protection > + && !ix86_target_stack_probe () > + && to_allocate > get_probe_interval ())); > +} > + > /* Fill structure ix86_frame about frame of currently computed function. */ > > static void > @@ -7189,20 +7209,7 @@ ix86_compute_frame_layout (void) > /* Size prologue needs to allocate. */ > to_allocate = offset - frame->sse_reg_save_offset; > > - if ((!to_allocate && frame->nregs <= 1) > - || (TARGET_64BIT && to_allocate >= HOST_WIDE_INT_C (0x80000000)) > - /* If static stack checking is enabled and done with probes, > - the registers need to be saved before allocating the frame. */ > - || flag_stack_check == STATIC_BUILTIN_STACK_CHECK > - /* If stack clash probing needs a loop, then it needs a > - scratch register. But the returned register is only guaranteed > - to be safe to use after register saves are complete. So if > - stack clash protections are enabled and the allocated frame is > - larger than the probe interval, then use pushes to save > - callee saved registers. */ > - || (flag_stack_clash_protection > - && !ix86_target_stack_probe () > - && to_allocate > get_probe_interval ())) > + if (save_regs_using_push_pop (to_allocate)) > frame->save_regs_using_mov = false; > > if (ix86_using_red_zone () > @@ -7660,7 +7667,9 @@ ix86_emit_save_regs_using_mov (HOST_WIDE_INT cfa_offset) > for (regno = 0; regno < FIRST_PSEUDO_REGISTER; regno++) > if (GENERAL_REGNO_P (regno) && ix86_save_reg (regno, true, true)) > { > - ix86_emit_save_reg_using_mov (word_mode, regno, cfa_offset); > + /* Skip registers, already processed by shrink wrap separate. */ > + if (!cfun->machine->reg_is_wrapped_separately[regno]) > + ix86_emit_save_reg_using_mov (word_mode, regno, cfa_offset); > cfa_offset -= UNITS_PER_WORD; > } > } > @@ -7753,8 +7762,12 @@ pro_epilogue_adjust_stack (rtx dest, rtx src, rtx > offset, > add_frame_related_expr = true; > } > > + if (crtl->shrink_wrapped_separate) > + insn = emit_insn (gen_rtx_SET (dest, gen_rtx_PLUS (Pmode, src, addend))); Please use ix86_expand_binary_operator here, it will correctly generate LEA during pro/epilogue generation. Also, please check indenting. > + else > insn = emit_insn (gen_pro_epilogue_adjust_stack_add > (Pmode, dest, src, addend)); > + > if (style >= 0) > ix86_add_queued_cfa_restore_notes (insn); > > @@ -9218,13 +9231,30 @@ ix86_expand_prologue (void) > the stack frame saving one cycle of the prologue. However, avoid > doing this if we have to probe the stack; at least on x86_64 the > stack probe can turn into a call that clobbers a red zone location. > */ > - else if (ix86_using_red_zone () > - && (! TARGET_STACK_PROBE > - || frame.stack_pointer_offset < CHECK_STACK_LIMIT)) > + else if ((ix86_using_red_zone () > + && (! TARGET_STACK_PROBE > + || frame.stack_pointer_offset < CHECK_STACK_LIMIT)) > + || crtl->shrink_wrapped_separate) > { > + HOST_WIDE_INT allocate_offset; > + if (crtl->shrink_wrapped_separate) > + { > + allocate_offset = m->fs.sp_offset - frame.stack_pointer_offset; > + > + /* Adjust the total offset at the beginning of the function. */ > + pro_epilogue_adjust_stack (stack_pointer_rtx, stack_pointer_rtx, > + GEN_INT (allocate_offset), -1, > + m->fs.cfa_reg == stack_pointer_rtx); > + m->fs.sp_offset = cfun->machine->frame.stack_pointer_offset; > + } > + > ix86_emit_save_regs_using_mov (frame.reg_save_offset); > - cfun->machine->red_zone_used = true; > int_registers_saved = true; > + > + if (ix86_using_red_zone () > + && (! TARGET_STACK_PROBE > + || frame.stack_pointer_offset < CHECK_STACK_LIMIT)) > + cfun->machine->red_zone_used = true; > } > } > > @@ -9344,6 +9374,7 @@ ix86_expand_prologue (void) > && flag_stack_clash_protection > && !ix86_target_stack_probe ()) > { > + gcc_assert (!crtl->shrink_wrapped_separate); Please put some vertical space after gcc_assert, so it will stand out more... > ix86_adjust_stack_and_probe (allocate, int_registers_saved, false); > allocate = 0; > } > @@ -9354,6 +9385,8 @@ ix86_expand_prologue (void) > { > const HOST_WIDE_INT probe_interval = get_probe_interval (); > > + gcc_assert (!crtl->shrink_wrapped_separate); > + > if (STACK_CHECK_MOVING_SP) > { > if (crtl->is_leaf > @@ -9410,12 +9443,14 @@ ix86_expand_prologue (void) > else if (!ix86_target_stack_probe () > || frame.stack_pointer_offset < CHECK_STACK_LIMIT) > { > + gcc_assert (!crtl->shrink_wrapped_separate); ... also here and in other places. > pro_epilogue_adjust_stack (stack_pointer_rtx, stack_pointer_rtx, > GEN_INT (-allocate), -1, > m->fs.cfa_reg == stack_pointer_rtx); > } > else > { > + gcc_assert (!crtl->shrink_wrapped_separate); > rtx eax = gen_rtx_REG (Pmode, AX_REG); > rtx r10 = NULL; > const bool sp_is_cfa_reg = (m->fs.cfa_reg == stack_pointer_rtx); > @@ -9801,30 +9836,35 @@ ix86_emit_restore_regs_using_mov (HOST_WIDE_INT > cfa_offset, > for (regno = 0; regno < FIRST_PSEUDO_REGISTER; regno++) > if (GENERAL_REGNO_P (regno) && ix86_save_reg (regno, maybe_eh_return, > true)) > { > - rtx reg = gen_rtx_REG (word_mode, regno); > - rtx mem; > - rtx_insn *insn; > - > - mem = choose_baseaddr (cfa_offset, NULL); > - mem = gen_frame_mem (word_mode, mem); > - insn = emit_move_insn (reg, mem); > > - if (m->fs.cfa_reg == crtl->drap_reg && regno == REGNO > (crtl->drap_reg)) > + /* Skip registers, already processed by shrink wrap separate. */ > + if (!cfun->machine->reg_is_wrapped_separately[regno]) > { > - /* Previously we'd represented the CFA as an expression > - like *(%ebp - 8). We've just popped that value from > - the stack, which means we need to reset the CFA to > - the drap register. This will remain until we restore > - the stack pointer. */ > - add_reg_note (insn, REG_CFA_DEF_CFA, reg); > - RTX_FRAME_RELATED_P (insn) = 1; > + rtx reg = gen_rtx_REG (word_mode, regno); > + rtx mem; > + rtx_insn *insn; > > - /* This means that the DRAP register is valid for addressing. */ > - m->fs.drap_valid = true; > - } > - else > - ix86_add_cfa_restore_note (NULL, reg, cfa_offset); > + mem = choose_baseaddr (cfa_offset, NULL); > + mem = gen_frame_mem (word_mode, mem); > + insn = emit_move_insn (reg, mem); > > + if (m->fs.cfa_reg == crtl->drap_reg > + && regno == REGNO (crtl->drap_reg)) > + { > + /* Previously we'd represented the CFA as an expression > + like *(%ebp - 8). We've just popped that value from > + the stack, which means we need to reset the CFA to > + the drap register. This will remain until we restore > + the stack pointer. */ > + add_reg_note (insn, REG_CFA_DEF_CFA, reg); > + RTX_FRAME_RELATED_P (insn) = 1; > + > + /* DRAP register is valid for addressing. */ > + m->fs.drap_valid = true; > + } > + else > + ix86_add_cfa_restore_note (NULL, reg, cfa_offset); > + } > cfa_offset -= UNITS_PER_WORD; > } > } > @@ -10103,10 +10143,11 @@ ix86_expand_epilogue (int style) > less work than reloading sp and popping the register. */ > else if (!sp_valid_at (frame.hfp_save_offset) && frame.nregs <= 1) > restore_regs_via_mov = true; > - else if (TARGET_EPILOGUE_USING_MOVE > - && cfun->machine->use_fast_prologue_epilogue > - && (frame.nregs > 1 > - || m->fs.sp_offset != reg_save_offset)) > + else if (crtl->shrink_wrapped_separate > + || (TARGET_EPILOGUE_USING_MOVE > + && cfun->machine->use_fast_prologue_epilogue > + && (frame.nregs > 1 > + || m->fs.sp_offset != reg_save_offset))) > restore_regs_via_mov = true; > else if (frame_pointer_needed > && !frame.nregs > @@ -10120,6 +10161,9 @@ ix86_expand_epilogue (int style) > else > restore_regs_via_mov = false; > > + if (crtl->shrink_wrapped_separate) > + gcc_assert (restore_regs_via_mov); > + > if (restore_regs_via_mov || frame.nsseregs) > { > /* Ensure that the entire register save area is addressable via > @@ -10172,6 +10216,7 @@ ix86_expand_epilogue (int style) > gcc_assert (m->fs.sp_offset == UNITS_PER_WORD); > gcc_assert (!crtl->drap_reg); > gcc_assert (!frame.nregs); > + gcc_assert (!crtl->shrink_wrapped_separate); > } > else if (restore_regs_via_mov) > { > @@ -10186,6 +10231,8 @@ ix86_expand_epilogue (int style) > rtx sa = EH_RETURN_STACKADJ_RTX; > rtx_insn *insn; > > + gcc_assert (!crtl->shrink_wrapped_separate); > + > /* Stack realignment doesn't work with eh_return. */ > if (crtl->stack_realign_needed) > sorry ("Stack realignment not supported with " > @@ -28066,6 +28113,194 @@ ix86_cannot_copy_insn_p (rtx_insn *insn) > #undef TARGET_DOCUMENTATION_NAME > #define TARGET_DOCUMENTATION_NAME "x86" > > +/* Implement TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS. */ > +sbitmap > +ix86_get_separate_components (void) > +{ > + HOST_WIDE_INT offset, to_allocate; > + sbitmap components = sbitmap_alloc (FIRST_PSEUDO_REGISTER); > + bitmap_clear (components); > + struct machine_function *m = cfun->machine; > + > + offset = m->frame.stack_pointer_offset; > + to_allocate = offset - m->frame.sse_reg_save_offset; > + > + /* Shrink wrap separate uses MOV, which means APX PPX cannot be used. > + Experiments show that APX PPX can speed up the prologue. If the > function > + does not exit early during actual execution, then using APX PPX is > faster. > + If the function always exits early during actual execution, then shrink > + wrap separate reduces the number of MOV (PUSH/POP) instructions actually > + executed, thus speeding up execution. > + foo: > + movl $1, %eax > + testq %rdi, %rdi > + jne.L60 > + ret ---> early return. > + .L60: > + subq $88, %rsp ---> belong to prologue. > + xorl %eax, %eax > + movq %rbx, 40 (%rsp) ---> belong to prologue. > + movq 8 (%rdi), %rbx > + movq %rbp, 48 (%rsp) ---> belong to prologue. > + movq %rdi, %rbp > + testq %rbx, %rbx > + jne.L61 > + movq 40 (%rsp), %rbx > + movq 48 (%rsp), %rbp > + addq $88, %rsp > + ret > + .L61: > + movq %r12, 56 (%rsp) ---> belong to prologue. > + movq %r13, 64 (%rsp) ---> belong to prologue. > + movq %r14, 72 (%rsp) ---> belong to prologue. Please remove leading spaces before tab in the above assembly. "git diff" will mark this part in red. > + ... ... > + > + It is a trade-off. Disable shrink wrap separate when PPX is enabled. > */ No need to reiterate that this is a trade-off. You already explained it above in great detail. > + if ((TARGET_APX_PPX && !crtl->calls_eh_return) > + || cfun->machine->func_type != TYPE_NORMAL > + || TARGET_SEH > + || crtl->stack_realign_needed > + || m->call_ms2sysv) > + return components; > + > + /* Since shrink wrapping separate uses MOV instead of PUSH/POP. > + Disable shrink wrap separate when MOV is prohibited. */ > + if (save_regs_using_push_pop (to_allocate)) > + return components; > + > + for (unsigned int regno = 0; regno < FIRST_PSEUDO_REGISTER; regno++) > + if (GENERAL_REGNO_P (regno) && ix86_save_reg (regno, true, true)) > + { > + /* Skip registers with large offsets, where a pseudo may be needed. > */ > + if (IN_RANGE (offset, -0x8000, 0x7fff)) > + bitmap_set_bit (components, regno); > + offset += UNITS_PER_WORD; > + } > + > + /* Don't mess with the following registers. */ > + if (frame_pointer_needed) > + bitmap_clear_bit (components, HARD_FRAME_POINTER_REGNUM); > + > + if (crtl->drap_reg) > + bitmap_clear_bit (components, REGNO (crtl->drap_reg)); > + > + if (pic_offset_table_rtx) > + bitmap_clear_bit (components, REAL_PIC_OFFSET_TABLE_REGNUM); > + > + return components; > +} > + > +/* Implement TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB. */ > +sbitmap > +ix86_components_for_bb (basic_block bb) > +{ > + bitmap in = DF_LIVE_IN (bb); > + bitmap gen = &DF_LIVE_BB_INFO (bb)->gen; > + bitmap kill = &DF_LIVE_BB_INFO (bb)->kill; > + > + sbitmap components = sbitmap_alloc (FIRST_PSEUDO_REGISTER); > + bitmap_clear (components); > + > + function_abi_aggregator callee_abis; > + rtx_insn *insn; > + FOR_BB_INSNS (bb, insn) > + if (CALL_P (insn)) > + callee_abis.note_callee_abi (insn_callee_abi (insn)); > + HARD_REG_SET extra_caller_saves = callee_abis.caller_save_regs > (*crtl->abi); > + > + /* GPRs are used in a bb if they are in the IN, GEN, or KILL sets. */ > + for (unsigned int regno = 0; regno < FIRST_PSEUDO_REGISTER; regno++) > + if (!fixed_regs[regno] > + && (TEST_HARD_REG_BIT (extra_caller_saves, regno) > + || bitmap_bit_p (in, regno) > + || bitmap_bit_p (gen, regno) > + || bitmap_bit_p (kill, regno))) > + bitmap_set_bit (components, regno); > + > + return components; > +} > + > +/* Implement TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS. */ > +void > +ix86_disqualify_components (sbitmap, edge, sbitmap, bool) > +{ > + /* Nothing to do for x86. */ > +} > + > +/* Implement TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS. */ > +void > +ix86_emit_prologue_components (sbitmap components) > +{ > + HOST_WIDE_INT cfa_offset; > + struct machine_function *m = cfun->machine; > + > + cfa_offset = m->frame.reg_save_offset + m->fs.sp_offset > + - m->frame.stack_pointer_offset; > + for (unsigned int regno = 0; regno < FIRST_PSEUDO_REGISTER; regno++) > + if (GENERAL_REGNO_P (regno) && ix86_save_reg (regno, true, true)) > + { > + if (bitmap_bit_p (components, regno)) > + ix86_emit_save_reg_using_mov (word_mode, regno, cfa_offset); > + cfa_offset -= UNITS_PER_WORD; > + } > +} > + > +/* Implement TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS. */ > +void > +ix86_emit_epilogue_components (sbitmap components) > +{ > + HOST_WIDE_INT cfa_offset; > + struct machine_function *m = cfun->machine; > + cfa_offset = m->frame.reg_save_offset + m->fs.sp_offset > + - m->frame.stack_pointer_offset; > + > + for (unsigned int regno = 0; regno < FIRST_PSEUDO_REGISTER; regno++) > + if (GENERAL_REGNO_P (regno) && ix86_save_reg (regno, true, true)) > + { > + if (bitmap_bit_p (components, regno)) > + { > + rtx reg = gen_rtx_REG (word_mode, regno); > + rtx mem; > + rtx_insn *insn; > + > + mem = choose_baseaddr (cfa_offset, NULL); > + mem = gen_frame_mem (word_mode, mem); > + insn = emit_move_insn (reg, mem); > + > + RTX_FRAME_RELATED_P (insn) = 1; > + add_reg_note (insn, REG_CFA_RESTORE, reg); > + } > + cfa_offset -= UNITS_PER_WORD; > + } > +} > + Please put a one line comment as is the case with the above functions. > +void > +ix86_set_handled_components (sbitmap components) > +{ > + for (unsigned int regno = 0; regno < FIRST_PSEUDO_REGISTER; regno++) > + if (bitmap_bit_p (components, regno)) > + { > + cfun->machine->reg_is_wrapped_separately[regno] = true; > + cfun->machine->use_fast_prologue_epilogue = true; > + cfun->machine->frame.save_regs_using_mov = true; > + } > +} > + > +#undef TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS > +#define TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS > ix86_get_separate_components > +#undef TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB > +#define TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB ix86_components_for_bb > +#undef TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS > +#define TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS ix86_disqualify_components > +#undef TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS > +#define TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS \ > + ix86_emit_prologue_components > +#undef TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS > +#define TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS \ > + ix86_emit_epilogue_components > +#undef TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS > +#define TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS ix86_set_handled_components > + > struct gcc_target targetm = TARGET_INITIALIZER; > > #include "gt-i386.h" > diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h > index d32d9ad997e..5d4fd333169 100644 > --- a/gcc/config/i386/i386.h > +++ b/gcc/config/i386/i386.h > @@ -2821,6 +2821,7 @@ struct GTY(()) machine_function { > /* Cached initial frame layout for the current function. */ > struct ix86_frame frame; A short comment would be nice here... > + bool reg_is_wrapped_separately[FIRST_PSEUDO_REGISTER]; ... and some vertical space here. > /* For -fsplit-stack support: A stack local which holds a pointer to > the stack arguments for a function with a variable number of > arguments. This is set at the start of the function and is used > diff --git a/gcc/testsuite/g++.target/i386/shrink_wrap_separate.C > b/gcc/testsuite/g++.target/i386/shrink_wrap_separate.C > new file mode 100644 > index 00000000000..31bd984df02 > --- /dev/null > +++ b/gcc/testsuite/g++.target/i386/shrink_wrap_separate.C > @@ -0,0 +1,24 @@ > +/* { dg-do compile { target { ! ia32 } } } */ Does new functionality also handle 32-bit targets? If this is the case, then please use two scan-rtl-dumps, one for ia32 and the other for ! ia32. Don't disable the test just because of a different string in the dump. > +/* { dg-options "-O2 -fdump-rtl-pro_and_epilogue" } */ > +typedef struct a b; > +typedef double c; > +struct a { > + b *d; > + b *e; > +}; > +struct f { > + c g; > +}; > +inline bool h(c i, b *m) { > + b *j = m->e; > + for (; m->e; j = j->d) > + if (h(i, j)) > + return 0; > + return m; > +} > +bool k() { > + f *l; > + b *n; > + return h(l->g, n); > +} > +/* { dg-final { scan-rtl-dump "The components we wrap separately are \\\[sep > 40 41 42 43\\\]" "pro_and_epilogue" } } */ > diff --git a/gcc/testsuite/gcc.target/i386/interrupt-16.c > b/gcc/testsuite/gcc.target/i386/interrupt-16.c > index cb45ba54e3d..ca4441b3aee 100644 > --- a/gcc/testsuite/gcc.target/i386/interrupt-16.c > +++ b/gcc/testsuite/gcc.target/i386/interrupt-16.c > @@ -18,5 +18,5 @@ foo (int i) > /* { dg-final { scan-assembler-not "(push|pop)(l|q)\[\\t \]*%(r|e)bp" } } */ > /* { dg-final { scan-assembler-not "(push|pop)l\[\\t \]*%edi" { target ia32 > } } } */ > /* { dg-final { scan-assembler-not "(push|pop)q\[\\t \]*%r\[0-9\]+" { target > { ! ia32 } } } } */ > -/* { dg-final { scan-assembler-times "pushq\[\\t \]*%rdi" 1 { target { ! > ia32 } } } } */ > -/* { dg-final { scan-assembler-times "popq\[\\t \]*%rdi" 1 { target { ! ia32 > } } } } */ > +/* { dg-final { scan-assembler-times "(pushq.*%rdi|subq.*\\\$8,.*%rsp)" 1 { > target { ! ia32 } } } } */ > +/* { dg-final { scan-assembler-times "(popq.*%rdi|addq.*\\\$8,.*%rsp)" 1 { > target { ! ia32 } } } } */ > diff --git a/gcc/testsuite/gcc.target/i386/shrink_wrap_separate_check_lea.c > b/gcc/testsuite/gcc.target/i386/shrink_wrap_separate_check_lea.c > new file mode 100644 > index 00000000000..0f2449f68b6 > --- /dev/null > +++ b/gcc/testsuite/gcc.target/i386/shrink_wrap_separate_check_lea.c > @@ -0,0 +1,29 @@ > +/* { dg-do compile { target { ! ia32 } } } */ > +/* { dg-options "-O2 -fdump-rtl-pro_and_epilogue" } */ > + > +/* Avoid inserting sub between test-je-jle to change EFlags, lea should be > used here > + xorl %eax, %eax > + testl %edi, %edi > + je .L11 > + sub $16, %rsp ------> leaq -16(%rsp), %rsp > + movq %r13, 8(%rsp) > + movl $1, %r13d > + jle .L4 > +*/ > +int foo (int num) > +{ > + if (!num) > + return 0; > + > + register int r13 __asm ("r13") = 1; > + > + for ( int i = 0; i < num; i++) > + { > + register int r12 __asm ("r12") = 1; > + asm volatile ("" : "+r" (r12), "+r" (r13)); > + } > + > + return 1; > +} > +/* { dg-final { scan-rtl-dump "The components we wrap separately are \\\[sep > 40\\\]" "pro_and_epilogue" } } */ > +/* { dg-final { scan-assembler "leaq.*(%rsp)" } } */ > diff --git a/gcc/testsuite/gcc.target/x86_64/abi/callabi/leaf-2.c > b/gcc/testsuite/gcc.target/x86_64/abi/callabi/leaf-2.c > index 5f3d3e166af..46fc4648dbd 100644 > --- a/gcc/testsuite/gcc.target/x86_64/abi/callabi/leaf-2.c > +++ b/gcc/testsuite/gcc.target/x86_64/abi/callabi/leaf-2.c > @@ -1,5 +1,5 @@ > /* { dg-do compile } */ > -/* { dg-options "-O2 -fno-tree-vectorize -mabi=sysv" } */ > +/* { dg-options "-O2 -fno-tree-vectorize -mabi=sysv > -fno-shrink-wrap-separate" } */ > > extern int glb1, gbl2, gbl3; > > diff --git a/gcc/testsuite/gfortran.dg/guality/arg1.f90 > b/gcc/testsuite/gfortran.dg/guality/arg1.f90 > index 332a4ed1d87..775b7bb304f 100644 > --- a/gcc/testsuite/gfortran.dg/guality/arg1.f90 > +++ b/gcc/testsuite/gfortran.dg/guality/arg1.f90 > @@ -1,5 +1,5 @@ > ! { dg-do run } > -! { dg-options "-g" } > +! { dg-options "-fno-shrink-wrap -g" } > integer :: a(10), b(12) > call sub (a, 10) > call sub (b, 12) > -- > 2.34.1 >