> > From: Lili Cui <lili....@intel.com> > > > > Hi Uros, > > > > Thank you very much for providing detailed BKM to reproduce Linux kernel > > boot > failure. My patch and Matz's patch have this problem. We inserted a SUB > between > TEST and JLE, and the SUB changes the value of EFlags. The branch JLE here > went > wrong, and a null pointer appeared after passing through many functions. I > created a small case for it and added it into my patch. I use lea instead of > sub/add > to fix this issue. Shrink-wrap also benefits from the lea instruction. > > Thanks for your thorough testing! I think that introducing a boot test for > the linux > kernel (besides SPEC tests) in your testing is a great addition, and I would > recommend to perform it especially for an invasive change like this. In > addition, > Intel has a continuous integration testing for the linux kernel patches, > perhaps > something similar can be used to boot and test linux with GCC changes. > Agree, your suggestion makes sense.
> > Real case: > > Avoid inserting sub between test-je-jle to change EFlags, lea should be > > used here > > xorl %eax, %eax > > testl %edi, %edi > > je .L11 > > sub $16, %rsp ------> leaq -16(%rsp), %rsp > > FAOD, you mean to insert SUB between TESTL and JE? The above case is not > problematic, because flags are produced and consumed without being clobbered. > > > movq %r13, 8(%rsp) > > movl $1, %r13d > > jle .L4 > > Not between TESTL and JE, but between TESTL and JLE. Before shrink-wrapping the sequence should be: push %r13 push ... xorl %eax, %eax testl %edi, %edi je .L11 --- ---> It is an early return. movl $1, %r13d jle .L4 shrink-wrapping-separate replaces "push %r13" with "sub $16, %rsp" + "movq %r13, 8(%rsp)" and inserts it before "movl $1, %r13d". "sub $16, %rsp" modifies the SF flag, and then we take the wrong branch. > > GCC change: > > Emit lea in ix86_expand_prologue/ ix86_expand_epilogue, peephole2 will > > decide > whether to replace lea with sub/add depending on the situation. > > > > > > I learned Matz's patch, the effect we achieved is basically the same, > > although > there are some differences in the implementation methods. > > > > I incorporated some of Matz's good ideas into my patch. > > 1. Matz added a lot of assertions to detect errors in time, and I also > > added some > in this patch. > > 2. Matz turned off shrink-wrap-separate judgments in some special scenarios, > and I also added these judgments. > > > > Still keep the difference: > > 1. The position of inserting SUB is different. Matz inserts SUB at the end > > of > prolog, I put SUB at the first line of prologue. I keep my way here. > > 2. Matz handles both SSE and general registers, while I only handle general > registers. Enabling SSE registers is easy, but the SSE part seems to be > Windows ABI > related. I have no way to test it yet, so I won't add it for now. > > 3. After reloading, some registers were eliminated by optimization are > > still placed > in regs_ever_live, resulting in redundant push/pop when generating prologue > later, my patch will eliminate these redundant registers and keep the stack > frame > layout the same as before. > > > > Collected spec2017 performance on ZNVER5, EMR and ICELAKE. No > performance regression was observed. > > For O2 multi-copy : > > 511.povray_r improved by 2.8% on ZNVER5. > > 511.povray_r improved by 4% on EMR > > 511.povray_r improved by 3.3 % ~ 4.6% on ICELAKE. > > > > Bootstrapped & regtested on x86-64-pc-linux-gnu. > > Use this patch to build the latest Linux kernel and boot successfully. > > > > Thanks, > > Lili. > > > > > > This commit implements the target macros (TARGET_SHRINK_WRAP_*) that > > enable separate shrink wrapping for function prologues/epilogues in > > x86. > > > > When performing separate shrink wrapping, we choose to use mov instead > > of push/pop, because using push/pop is more complicated to handle rsp > > adjustment and may lose performance, so here we choose to use mov, > > which has a small impact on code size, but guarantees performance. > > > > Using mov means we need to use sub/add to maintain the stack frame. In > > some special cases, we need to use lea to prevent affecting EFlags. > > > > Avoid inserting sub between test-je-jle to change EFlags, lea should > > be used here. > > > > foo: > > xorl %eax, %eax > > testl %edi, %edi > > je .L11 > > sub $16, %rsp ------> leaq -16(%rsp), %rsp > > Also here, is the placement of SUB correct? > Yes, it is before JLE. > > movq %r13, 8(%rsp) > > movl $1, %r13d > > jle .L4 > > > > Tested against SPEC CPU 2017, this change always has a net-positive > > effect on the dynamic instruction count. See the following table for > > the breakdown on how this reduces the number of dynamic instructions > > per workload on a like-for-like (with/without this commit): > > > > instruction count base with commit (commit-base)/commit > > 502.gcc_r 98666845943 96891561634 -1.80% > > 526.blender_r 6.21226E+11 6.12992E+11 -1.33% > > 520.omnetpp_r 1.1241E+11 1.11093E+11 -1.17% > > 500.perlbench_r 1271558717 1263268350 -0.65% > > 523.xalancbmk_r 2.20103E+11 2.18836E+11 -0.58% > > 531.deepsjeng_r 2.73591E+11 2.72114E+11 -0.54% > > 500.perlbench_r 64195557393 63881512409 -0.49% > > 541.leela_r 2.99097E+11 2.98245E+11 -0.29% > > 548.exchange2_r 1.27976E+11 1.27784E+11 -0.15% > > 527.cam4_r 88981458425 88887334679 -0.11% > > 554.roms_r 2.60072E+11 2.59809E+11 -0.10% > > > > Collected spec2017 performance on ZNVER5, EMR and ICELAKE. No > performance regression was observed. > > > > For O2 multi-copy : > > 511.povray_r improved by 2.8% on ZNVER5. > > 511.povray_r improved by 4% on EMR > > 511.povray_r improved by 3.3 % ~ 4.6% on ICELAKE. > > > > gcc/ChangeLog: > > > > * config/i386/i386-protos.h (ix86_get_separate_components): > > New function. > > (ix86_components_for_bb): Likewise. > > (ix86_disqualify_components): Likewise. > > (ix86_emit_prologue_components): Likewise. > > (ix86_emit_epilogue_components): Likewise. > > (ix86_set_handled_components): Likewise. > > * config/i386/i386.cc (save_regs_using_push_pop): > > Encapsulate code. > > Better say: "Split from ix86_compute_frame_layout." Done. > > > (ix86_compute_frame_layout): > > Handle save_regs_using_push_pop. > > Use save_regs_using_push_pop. Done. > > > (ix86_emit_save_regs_using_mov): > > Skip registers that are wrapped separately. > > (pro_epilogue_adjust_stack): Emit insn without clobber. > > (ix86_expand_prologue): Likewise. > > Looks like "Likewise" applies to ix86_emit_save_regs_using_mov. There are > several > other changes in ix86_expand_prologue. Please document them in the ChangeLog > entry. > > > (ix86_emit_restore_regs_using_mov): Likewise. > > (ix86_expand_epilogue): Likewise. > > Also in the above tow functions. > Done. > > (ix86_get_separate_components): New function. > > (ix86_components_for_bb): Likewise. > > (ix86_disqualify_components): Likewise. > > (ix86_emit_prologue_components): Likewise. > > (ix86_emit_epilogue_components): Likewise. > > (ix86_set_handled_components): Likewise. > > (TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS): Define. > > (TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB): Likewise. > > (TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS): Likewise. > > (TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS): Likewise. > > (TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS): Likewise. > > (TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS): Likewise. > > * config/i386/i386.h (struct machine_function):Add > > reg_is_wrapped_separately array for register wrapping > > information. > > > > gcc/testsuite/ChangeLog: > > > > * gcc.target/x86_64/abi/callabi/leaf-2.c: Adjust the test. > > * gcc.target/i386/interrupt-16.c: Likewise. > > * gfortran.dg/guality/arg1.f90: Likewise. > > * g++.target/i386/shrink_wrap_separate.C: New test. > > * gcc.target/i386/shrink_wrap_separate_check_lea.c: Likewise. > > > > Co-authored-by: Michael Matz <m...@suse.de> > > --- > > gcc/config/i386/i386-protos.h | 7 + > > gcc/config/i386/i386.cc | 321 +++++++++++++++--- > > gcc/config/i386/i386.h | 1 + > > .../g++.target/i386/shrink_wrap_separate.C | 24 ++ > > gcc/testsuite/gcc.target/i386/interrupt-16.c | 4 +- > > .../i386/shrink_wrap_separate_check_lea.c | 29 ++ > > .../gcc.target/x86_64/abi/callabi/leaf-2.c | 2 +- > > gcc/testsuite/gfortran.dg/guality/arg1.f90 | 2 +- > > 8 files changed, 343 insertions(+), 47 deletions(-) create mode > > 100644 gcc/testsuite/g++.target/i386/shrink_wrap_separate.C > > create mode 100644 > > gcc/testsuite/gcc.target/i386/shrink_wrap_separate_check_lea.c > > > > diff --git a/gcc/config/i386/i386-protos.h > > b/gcc/config/i386/i386-protos.h index 10863ab9e9d..86194b3d319 100644 > > --- a/gcc/config/i386/i386-protos.h > > +++ b/gcc/config/i386/i386-protos.h > > @@ -437,6 +437,13 @@ extern rtl_opt_pass *make_pass_align_tight_loops > > (gcc::context *); extern bool ix86_has_no_direct_extern_access; > > extern bool ix86_rpad_gate (); > > > > +extern sbitmap ix86_get_separate_components (void); extern sbitmap > > +ix86_components_for_bb (basic_block); extern void > > +ix86_disqualify_components (sbitmap, edge, sbitmap, bool); extern > > +void ix86_emit_prologue_components (sbitmap); extern void > > +ix86_emit_epilogue_components (sbitmap); extern void > > +ix86_set_handled_components (sbitmap); > > + > > /* In i386-expand.cc. */ > > bool ix86_check_builtin_isa_match (unsigned int, HOST_WIDE_INT*, > > HOST_WIDE_INT*); diff --git > > a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc index > > d48654a729a..8f2193efe29 100644 > > --- a/gcc/config/i386/i386.cc > > +++ b/gcc/config/i386/i386.cc > > @@ -6905,6 +6905,26 @@ ix86_pro_and_epilogue_can_use_push2pop2 (int > nregs) > > && (nregs + aligned) >= 3; > > } > > > > +/* Check if push/pop should be used to save/restore registers. */ > > +static bool save_regs_using_push_pop (HOST_WIDE_INT to_allocate) { > > + return ((!to_allocate && cfun->machine->frame.nregs <= 1) > > + || (TARGET_64BIT && to_allocate >= HOST_WIDE_INT_C (0x80000000)) > > + /* If static stack checking is enabled and done with probes, > > + the registers need to be saved before allocating the frame. */ > > + || flag_stack_check == STATIC_BUILTIN_STACK_CHECK > > + /* If stack clash probing needs a loop, then it needs a > > + scratch register. But the returned register is only guaranteed > > + to be safe to use after register saves are complete. So if > > + stack clash protections are enabled and the allocated frame is > > + larger than the probe interval, then use pushes to save > > + callee saved registers. */ > > + || (flag_stack_clash_protection > > + && !ix86_target_stack_probe () > > + && to_allocate > get_probe_interval ())); } > > + > > /* Fill structure ix86_frame about frame of currently computed > > function. */ > > > > static void > > @@ -7189,20 +7209,7 @@ ix86_compute_frame_layout (void) > > /* Size prologue needs to allocate. */ > > to_allocate = offset - frame->sse_reg_save_offset; > > > > - if ((!to_allocate && frame->nregs <= 1) > > - || (TARGET_64BIT && to_allocate >= HOST_WIDE_INT_C (0x80000000)) > > - /* If static stack checking is enabled and done with probes, > > - the registers need to be saved before allocating the frame. */ > > - || flag_stack_check == STATIC_BUILTIN_STACK_CHECK > > - /* If stack clash probing needs a loop, then it needs a > > - scratch register. But the returned register is only guaranteed > > - to be safe to use after register saves are complete. So if > > - stack clash protections are enabled and the allocated frame is > > - larger than the probe interval, then use pushes to save > > - callee saved registers. */ > > - || (flag_stack_clash_protection > > - && !ix86_target_stack_probe () > > - && to_allocate > get_probe_interval ())) > > + if (save_regs_using_push_pop (to_allocate)) > > frame->save_regs_using_mov = false; > > > > if (ix86_using_red_zone () > > @@ -7660,7 +7667,9 @@ ix86_emit_save_regs_using_mov (HOST_WIDE_INT > cfa_offset) > > for (regno = 0; regno < FIRST_PSEUDO_REGISTER; regno++) > > if (GENERAL_REGNO_P (regno) && ix86_save_reg (regno, true, true)) > > { > > - ix86_emit_save_reg_using_mov (word_mode, regno, cfa_offset); > > + /* Skip registers, already processed by shrink wrap separate. */ > > + if (!cfun->machine->reg_is_wrapped_separately[regno]) > > + ix86_emit_save_reg_using_mov (word_mode, regno, cfa_offset); > > cfa_offset -= UNITS_PER_WORD; > > } > > } > > @@ -7753,8 +7762,12 @@ pro_epilogue_adjust_stack (rtx dest, rtx src, rtx > offset, > > add_frame_related_expr = true; > > } > > > > + if (crtl->shrink_wrapped_separate) > > + insn = emit_insn (gen_rtx_SET (dest, gen_rtx_PLUS (Pmode, src, > > + addend))); > > Please use ix86_expand_binary_operator here, it will correctly generate LEA > during > pro/epilogue generation. Also, please check indenting. > Ok. > > + else > > insn = emit_insn (gen_pro_epilogue_adjust_stack_add > > (Pmode, dest, src, addend)); > > + > > if (style >= 0) > > ix86_add_queued_cfa_restore_notes (insn); > > > > @@ -9218,13 +9231,30 @@ ix86_expand_prologue (void) > > the stack frame saving one cycle of the prologue. However, avoid > > doing this if we have to probe the stack; at least on x86_64 the > > stack probe can turn into a call that clobbers a red zone > > location. */ > > - else if (ix86_using_red_zone () > > - && (! TARGET_STACK_PROBE > > - || frame.stack_pointer_offset < CHECK_STACK_LIMIT)) > > + else if ((ix86_using_red_zone () > > + && (! TARGET_STACK_PROBE > > + || frame.stack_pointer_offset < CHECK_STACK_LIMIT)) > > + || crtl->shrink_wrapped_separate) > > { > > + HOST_WIDE_INT allocate_offset; > > + if (crtl->shrink_wrapped_separate) > > + { > > + allocate_offset = m->fs.sp_offset - > > + frame.stack_pointer_offset; > > + > > + /* Adjust the total offset at the beginning of the function. > > */ > > + pro_epilogue_adjust_stack (stack_pointer_rtx, > > stack_pointer_rtx, > > + GEN_INT (allocate_offset), -1, > > + m->fs.cfa_reg == > > stack_pointer_rtx); > > + m->fs.sp_offset = cfun->machine->frame.stack_pointer_offset; > > + } > > + > > ix86_emit_save_regs_using_mov (frame.reg_save_offset); > > - cfun->machine->red_zone_used = true; > > int_registers_saved = true; > > + > > + if (ix86_using_red_zone () > > + && (! TARGET_STACK_PROBE > > + || frame.stack_pointer_offset < CHECK_STACK_LIMIT)) > > + cfun->machine->red_zone_used = true; > > } > > } > > > > @@ -9344,6 +9374,7 @@ ix86_expand_prologue (void) > > && flag_stack_clash_protection > > && !ix86_target_stack_probe ()) > > { > > + gcc_assert (!crtl->shrink_wrapped_separate); > > Please put some vertical space after gcc_assert, so it will stand out more... > Done. > > ix86_adjust_stack_and_probe (allocate, int_registers_saved, false); > > allocate = 0; > > } > > @@ -9354,6 +9385,8 @@ ix86_expand_prologue (void) > > { > > const HOST_WIDE_INT probe_interval = get_probe_interval (); > > > > + gcc_assert (!crtl->shrink_wrapped_separate); > > + > > if (STACK_CHECK_MOVING_SP) > > { > > if (crtl->is_leaf > > @@ -9410,12 +9443,14 @@ ix86_expand_prologue (void) > > else if (!ix86_target_stack_probe () > > || frame.stack_pointer_offset < CHECK_STACK_LIMIT) > > { > > + gcc_assert (!crtl->shrink_wrapped_separate); > > ... also here and in other places. Done. > > +/* Implement TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS. */ > sbitmap > > +ix86_get_separate_components (void) { > > + HOST_WIDE_INT offset, to_allocate; > > + sbitmap components = sbitmap_alloc (FIRST_PSEUDO_REGISTER); > > + bitmap_clear (components); > > + struct machine_function *m = cfun->machine; > > + > > + offset = m->frame.stack_pointer_offset; to_allocate = offset - > > + m->frame.sse_reg_save_offset; > > + > > + /* Shrink wrap separate uses MOV, which means APX PPX cannot be used. > > + Experiments show that APX PPX can speed up the prologue. If the > > function > > + does not exit early during actual execution, then using APX PPX is > > faster. > > + If the function always exits early during actual execution, then > > shrink > > + wrap separate reduces the number of MOV (PUSH/POP) instructions > > actually > > + executed, thus speeding up execution. > > + foo: > > + movl $1, %eax > > + testq %rdi, %rdi > > + jne.L60 > > + ret ---> early return. > > + .L60: > > + subq $88, %rsp ---> belong to prologue. > > + xorl %eax, %eax > > + movq %rbx, 40 (%rsp) ---> belong to prologue. > > + movq 8 (%rdi), %rbx > > + movq %rbp, 48 (%rsp) ---> belong to prologue. > > + movq %rdi, %rbp > > + testq %rbx, %rbx > > + jne.L61 > > + movq 40 (%rsp), %rbx > > + movq 48 (%rsp), %rbp > > + addq $88, %rsp > > + ret > > + .L61: > > + movq %r12, 56 (%rsp) ---> belong to prologue. > > + movq %r13, 64 (%rsp) ---> belong to prologue. > > + movq %r14, 72 (%rsp) ---> belong to prologue. > > Please remove leading spaces before tab in the above assembly. "git diff" > will mark > this part in red. > Done. > > + ... ... > > + > > + It is a trade-off. Disable shrink wrap separate when PPX is > > + enabled. */ > > No need to reiterate that this is a trade-off. You already explained it above > in great > detail. > Done. > Please put a one line comment as is the case with the above functions. > Done. > > +void > > +ix86_set_handled_components (sbitmap components) { > > + for (unsigned int regno = 0; regno < FIRST_PSEUDO_REGISTER; regno++) > > + if (bitmap_bit_p (components, regno)) > > + { > > + cfun->machine->reg_is_wrapped_separately[regno] = true; > > + cfun->machine->use_fast_prologue_epilogue = true; > > + cfun->machine->frame.save_regs_using_mov = true; > > + } > > +} > > + > > +#undef TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS > > +#define TARGET_SHRINK_WRAP_GET_SEPARATE_COMPONENTS > > +ix86_get_separate_components #undef > > +TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB > > +#define TARGET_SHRINK_WRAP_COMPONENTS_FOR_BB > ix86_components_for_bb > > +#undef TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS > > +#define TARGET_SHRINK_WRAP_DISQUALIFY_COMPONENTS > > +ix86_disqualify_components #undef > > +TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS > > +#define TARGET_SHRINK_WRAP_EMIT_PROLOGUE_COMPONENTS \ > > + ix86_emit_prologue_components > > +#undef TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS > > +#define TARGET_SHRINK_WRAP_EMIT_EPILOGUE_COMPONENTS \ > > + ix86_emit_epilogue_components > > +#undef TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS > > +#define TARGET_SHRINK_WRAP_SET_HANDLED_COMPONENTS > > +ix86_set_handled_components > > + > > struct gcc_target targetm = TARGET_INITIALIZER; > > > > #include "gt-i386.h" > > diff --git a/gcc/config/i386/i386.h b/gcc/config/i386/i386.h index > > d32d9ad997e..5d4fd333169 100644 > > --- a/gcc/config/i386/i386.h > > +++ b/gcc/config/i386/i386.h > > @@ -2821,6 +2821,7 @@ struct GTY(()) machine_function { > > /* Cached initial frame layout for the current function. */ > > struct ix86_frame frame; > > A short comment would be nice here... Done. > > + bool reg_is_wrapped_separately[FIRST_PSEUDO_REGISTER]; > > ... and some vertical space here. > Done. > > /* For -fsplit-stack support: A stack local which holds a pointer to > > the stack arguments for a function with a variable number of > > arguments. This is set at the start of the function and is used > > diff --git a/gcc/testsuite/g++.target/i386/shrink_wrap_separate.C > > b/gcc/testsuite/g++.target/i386/shrink_wrap_separate.C > > new file mode 100644 > > index 00000000000..31bd984df02 > > --- /dev/null > > +++ b/gcc/testsuite/g++.target/i386/shrink_wrap_separate.C > > @@ -0,0 +1,24 @@ > > +/* { dg-do compile { target { ! ia32 } } } */ > > Does new functionality also handle 32-bit targets? If this is the case, then > please > use two scan-rtl-dumps, one for ia32 and the other for ! ia32. Don't disable > the > test just because of a different string in the dump. > You are right, I will fix this. Thanks, Lili.