branch optimizations

Mike Stump Wed, 27 Apr 2011 10:14:47 -0700

So, I have a machine that has many styles of branches, among them, a normal 
one, and a short version.  The short version is cheaper (sometimes).  The 
regular one is 1 (predicted), 7 mis-predicted.  The cost of mis-prediction can 
be substantially higher depending upon what is in the cache.  The short branch 
version only applies to forward 1-n instructions, and it's cost can be more 
expensive than a regular branch but is usually faster.  Another benefit of the 
short version is, it doesn't pollute the branch cache, thus, improving the 
prediction rates of other branches.  To obtain the costs of each style, I need 
to run some math taking into consideration the expected mis-prediction rates, 
and the counts (predicted or gcov) and the distance skipped over.  Also, I 
determine if the short branch style applies, if it doesn't, I just say the 
regular form is cheaper.


We can see evidence of the lack of completeness in the current code:

  if ((flag_reorder_blocks || flag_reorder_blocks_and_partition)
      /* Don't reorder blocks when optimizing for size because extra jump insns 
may                 
         be created; also barrier may create extra padding.                     
                    
                                                                                
                    
         More correctly we should have a block reordering mode that tried to    
                    
         minimize the combined size of all the jumps.  This would more or less  
                    
         automatically remove extra jumps, but would also try to use more short 
                    
         jumps instead of long jumps.  */
      && optimize_function_for_speed_p (cfun))
    {
      reorder_basic_blocks ();

Essentially, I think the optimization could help other ports, if they have a 
well balanced implementation or if they are limited by the branch prediction or 
if the branch cost is non-zero.  Chips that have lots of resources for 
prediction and/or can follow multiple paths aren't likely to see any benefit.

So, here is the patch in progress, should be close, but outstanding questions 
would be, are there any other ports that benefit from it?  If not, I'd probably 
say it's not suitable for the tree in this form.  Are there other ways, better 
ways of doing this?


If you like the concrete, consider this:

        jump equal, L5  predicted equal
        nop
L5:
        nop2
        ret

versus:

        jump not equal, L6  predicted equal
L7:
        nop2
        ret

L6:     nop
        b L7

So, if branches were 10 cycles, conditional branch 12 cycles, T is the number 
of times executed when the condition is true and F, the number of times false, 
the branch costs would be (T+F)*12 in the first case and (T+F)*12 + F*10 in the 
second case.  The later _always_ more expensive, if F is non-zero.  Further, 
this is true for all such non-zero costs, as long as there is a single F, 
ignoring cache issues.  Arguably, we could examine the cost of a branch and 
always perform this optimization if the cost is high enough, even without a 
short branch form.  That isn't included in the patch.

A better patch, would be to have the backend figure out mis-prediction rates, 
do the math and decide based upon all costs alone given all the costs of all 
the branches.

Thoughts?

Index: bb-reorder.c
===================================================================
--- bb-reorder.c        (revision 1310)
+++ bb-reorder.c        (revision 1311)
@@ -192,6 +192,48 @@ static void fix_edges_for_rarely_execute
 static void fix_crossing_conditional_branches (void);
 static void fix_crossing_unconditional_branches (void);
 
+/* Return true iff we should prefer the less frequent code
+   of basic block BB to be placed immediately after A.  This
+   is useful when the target has a lower cost branch instruction
+   that can avoid the misprediction costs of a normal branch.  */
+static bool
+short_branch_cheaper (const_basic_block bb ATTRIBUTE_UNUSED)
+{
+#ifdef TARGET_SHORT_BRANCH_CHEAPER
+  basic_block pbb, nbb;
+  edge e;
+  edge_iterator ei;
+  bool found = false;
+  
+  /* When the have:
+     A
+     |\
+     | C
+     |/
+     B
+     we check the port to see if TARGET_SHORT_BRANCH_CHEAPER (A) is
+     true.  It can use the frequencies, the misprediction costs and
+     branch distance to decide.  We only handle the simple case of a
+     single precessor to C (bb), a single successor to C (bb).  */
+  if (!single_pred_p (bb))
+    return false;
+  pbb = single_pred (bb);
+  if (!single_succ_p (bb))
+    return false;
+  nbb = single_succ (bb);
+  FOR_EACH_EDGE (e, ei, nbb->preds)
+    if (e->src == pbb)
+      found = true;
+  if (!found)
+    return false;
+
+  if (TARGET_SHORT_BRANCH_CHEAPER (BB_END (pbb)))
+    return true;
+#endif
+
+  return false;
+}
+
 /* Check to see if bb should be pushed into the next round of trace
    collections or not.  Reasons for pushing the block forward are 1).
    If the block is cold, we are doing partitioning, and there will be
@@ -214,7 +256,8 @@ push_to_next_round_p (const_basic_block
                          || probably_never_executed_bb_p (bb));
 
   if (there_exists_another_round
-      && block_not_hot_enough)
+      && block_not_hot_enough
+      && !short_branch_cheaper (bb))
     return true;
   else
     return false;
@@ -699,7 +742,8 @@ find_traces_1_round (int branch_th, int
                            & EDGE_CAN_FALLTHRU)
                        && !(single_succ_edge (e->dest)->flags & EDGE_COMPLEX)
                        && single_succ (e->dest) == best_edge->dest
-                       && 2 * e->dest->frequency >= EDGE_FREQUENCY (best_edge))
+                       && (2 * e->dest->frequency >= EDGE_FREQUENCY (best_edge)
+                           || short_branch_cheaper (e->dest)))
                      {
                        best_edge = e;
                        if (dump_file)

branch optimizations

Reply via email to