So, I have a machine that has many styles of branches, among them, a normal
one, and a short version. The short version is cheaper (sometimes). The
regular one is 1 (predicted), 7 mis-predicted. The cost of mis-prediction can
be substantially higher depending upon what is in the cache. The short branch
version only applies to forward 1-n instructions, and it's cost can be more
expensive than a regular branch but is usually faster. Another benefit of the
short version is, it doesn't pollute the branch cache, thus, improving the
prediction rates of other branches. To obtain the costs of each style, I need
to run some math taking into consideration the expected mis-prediction rates,
and the counts (predicted or gcov) and the distance skipped over. Also, I
determine if the short branch style applies, if it doesn't, I just say the
regular form is cheaper.
We can see evidence of the lack of completeness in the current code:
if ((flag_reorder_blocks || flag_reorder_blocks_and_partition)
/* Don't reorder blocks when optimizing for size because extra jump insns
may
be created; also barrier may create extra padding.
More correctly we should have a block reordering mode that tried to
minimize the combined size of all the jumps. This would more or less
automatically remove extra jumps, but would also try to use more short
jumps instead of long jumps. */
&& optimize_function_for_speed_p (cfun))
{
reorder_basic_blocks ();
Essentially, I think the optimization could help other ports, if they have a
well balanced implementation or if they are limited by the branch prediction or
if the branch cost is non-zero. Chips that have lots of resources for
prediction and/or can follow multiple paths aren't likely to see any benefit.
So, here is the patch in progress, should be close, but outstanding questions
would be, are there any other ports that benefit from it? If not, I'd probably
say it's not suitable for the tree in this form. Are there other ways, better
ways of doing this?
If you like the concrete, consider this:
jump equal, L5 predicted equal
nop
L5:
nop2
ret
versus:
jump not equal, L6 predicted equal
L7:
nop2
ret
L6: nop
b L7
So, if branches were 10 cycles, conditional branch 12 cycles, T is the number
of times executed when the condition is true and F, the number of times false,
the branch costs would be (T+F)*12 in the first case and (T+F)*12 + F*10 in the
second case. The later _always_ more expensive, if F is non-zero. Further,
this is true for all such non-zero costs, as long as there is a single F,
ignoring cache issues. Arguably, we could examine the cost of a branch and
always perform this optimization if the cost is high enough, even without a
short branch form. That isn't included in the patch.
A better patch, would be to have the backend figure out mis-prediction rates,
do the math and decide based upon all costs alone given all the costs of all
the branches.
Thoughts?
Index: bb-reorder.c
===================================================================
--- bb-reorder.c (revision 1310)
+++ bb-reorder.c (revision 1311)
@@ -192,6 +192,48 @@ static void fix_edges_for_rarely_execute
static void fix_crossing_conditional_branches (void);
static void fix_crossing_unconditional_branches (void);
+/* Return true iff we should prefer the less frequent code
+ of basic block BB to be placed immediately after A. This
+ is useful when the target has a lower cost branch instruction
+ that can avoid the misprediction costs of a normal branch. */
+static bool
+short_branch_cheaper (const_basic_block bb ATTRIBUTE_UNUSED)
+{
+#ifdef TARGET_SHORT_BRANCH_CHEAPER
+ basic_block pbb, nbb;
+ edge e;
+ edge_iterator ei;
+ bool found = false;
+
+ /* When the have:
+ A
+ |\
+ | C
+ |/
+ B
+ we check the port to see if TARGET_SHORT_BRANCH_CHEAPER (A) is
+ true. It can use the frequencies, the misprediction costs and
+ branch distance to decide. We only handle the simple case of a
+ single precessor to C (bb), a single successor to C (bb). */
+ if (!single_pred_p (bb))
+ return false;
+ pbb = single_pred (bb);
+ if (!single_succ_p (bb))
+ return false;
+ nbb = single_succ (bb);
+ FOR_EACH_EDGE (e, ei, nbb->preds)
+ if (e->src == pbb)
+ found = true;
+ if (!found)
+ return false;
+
+ if (TARGET_SHORT_BRANCH_CHEAPER (BB_END (pbb)))
+ return true;
+#endif
+
+ return false;
+}
+
/* Check to see if bb should be pushed into the next round of trace
collections or not. Reasons for pushing the block forward are 1).
If the block is cold, we are doing partitioning, and there will be
@@ -214,7 +256,8 @@ push_to_next_round_p (const_basic_block
|| probably_never_executed_bb_p (bb));
if (there_exists_another_round
- && block_not_hot_enough)
+ && block_not_hot_enough
+ && !short_branch_cheaper (bb))
return true;
else
return false;
@@ -699,7 +742,8 @@ find_traces_1_round (int branch_th, int
& EDGE_CAN_FALLTHRU)
&& !(single_succ_edge (e->dest)->flags & EDGE_COMPLEX)
&& single_succ (e->dest) == best_edge->dest
- && 2 * e->dest->frequency >= EDGE_FREQUENCY (best_edge))
+ && (2 * e->dest->frequency >= EDGE_FREQUENCY (best_edge)
+ || short_branch_cheaper (e->dest)))
{
best_edge = e;
if (dump_file)