I noticed that when there are registers to save (that can vary with ABI), shrink-wrapping would arrange for a more expeditious early return than when there were no registers to save, but still some dull argument copies to make for the main function, even if they are not needed for the early return path. Most of the logic to do shrink-wrapping also in the absence of register saves is already there, and the generated code indeed looks better when this is thus used. However, I couldn't find a difference in the execution time of the benchmarks I was looking at, presumably because the function didn't actually return early (doing things with an array of N elements where N might be zero... but it isn't for the actual data).
Does someone have a benchmark / computing load where the early return is beneficial? Or conversely, harmful?
2022-03-14 Joern Rennecke <joern.renne...@embecosm.com> * common.opt (fearly-return): New option. * shrink-wrap.cc (try_early_return): New function. (try_shrink_wrapping): Call try_early_return. diff --git a/gcc/common.opt b/gcc/common.opt index 8b6513de47c..901287fcad6 100644 --- a/gcc/common.opt +++ b/gcc/common.opt @@ -3607,4 +3607,8 @@ fipa-ra Common Var(flag_ipa_ra) Optimization Use caller save register across calls if possible. +fearly-return +Common Var(flag_early_return) Optimization Init(1) +Extend shrink-wrapping to prologue-free functions. + ; This comment is to ensure we retain the blank line above. diff --git a/gcc/shrink-wrap.cc b/gcc/shrink-wrap.cc index 30166bd20eb..31ab0ecff10 100644 --- a/gcc/shrink-wrap.cc +++ b/gcc/shrink-wrap.cc @@ -586,6 +586,42 @@ handle_simple_exit (edge e) INSN_UID (ret), e->src->index); } +/* Even if there is no prologue, we might have a number of argument + copy and initialization statements in the first basic block that + might be unnecessary if we return early. */ +/* ??? This might be overly agressive for super-scalar processors without + speculative execution in that we migth want to keep enough instructions + in front of the branch to fill all issue slots. + + If the branch depends on a register copied from another register + immediately before, later passes already take care of propagating the + copy into the branch. */ +void +try_early_return (edge *entry_edge) +{ + basic_block entry = (*entry_edge)->dest; + if (EDGE_COUNT (entry->succs) != 2 || !single_pred_p (entry)) + return; + edge e; + edge_iterator ei; + const int max_depth = 20; + + FOR_EACH_EDGE (e, ei, entry->succs) + { + basic_block dst = e->dest; + for (int i = max_depth; --i; dst = single_succ (dst)) + { + if (dst == EXIT_BLOCK_PTR_FOR_FN (cfun)) + { + prepare_shrink_wrap (entry); + return; + } + if (!single_succ_p (dst)) + break; + } + } +} + /* Try to perform a kind of shrink-wrapping, making sure the prologue/epilogue is emitted only around those parts of the function that require it. @@ -666,7 +702,11 @@ try_shrink_wrapping (edge *entry_edge, rtx_insn *prologue_seq) break; } if (empty_prologue) - return; + { + if (flag_early_return) + try_early_return (entry_edge); + return; + } /* Move some code down to expose more shrink-wrapping opportunities. */