So, I have a machine that has many styles of branches, among them, a normal one, and a short version. The short version is cheaper (sometimes). The regular one is 1 (predicted), 7 mis-predicted. The cost of mis-prediction can be substantially higher depending upon what is in the cache. The short branch version only applies to forward 1-n instructions, and it's cost can be more expensive than a regular branch but is usually faster. Another benefit of the short version is, it doesn't pollute the branch cache, thus, improving the prediction rates of other branches. To obtain the costs of each style, I need to run some math taking into consideration the expected mis-prediction rates, and the counts (predicted or gcov) and the distance skipped over. Also, I determine if the short branch style applies, if it doesn't, I just say the regular form is cheaper.
We can see evidence of the lack of completeness in the current code: if ((flag_reorder_blocks || flag_reorder_blocks_and_partition) /* Don't reorder blocks when optimizing for size because extra jump insns may be created; also barrier may create extra padding. More correctly we should have a block reordering mode that tried to minimize the combined size of all the jumps. This would more or less automatically remove extra jumps, but would also try to use more short jumps instead of long jumps. */ && optimize_function_for_speed_p (cfun)) { reorder_basic_blocks (); Essentially, I think the optimization could help other ports, if they have a well balanced implementation or if they are limited by the branch prediction or if the branch cost is non-zero. Chips that have lots of resources for prediction and/or can follow multiple paths aren't likely to see any benefit. So, here is the patch in progress, should be close, but outstanding questions would be, are there any other ports that benefit from it? If not, I'd probably say it's not suitable for the tree in this form. Are there other ways, better ways of doing this? If you like the concrete, consider this: jump equal, L5 predicted equal nop L5: nop2 ret versus: jump not equal, L6 predicted equal L7: nop2 ret L6: nop b L7 So, if branches were 10 cycles, conditional branch 12 cycles, T is the number of times executed when the condition is true and F, the number of times false, the branch costs would be (T+F)*12 in the first case and (T+F)*12 + F*10 in the second case. The later _always_ more expensive, if F is non-zero. Further, this is true for all such non-zero costs, as long as there is a single F, ignoring cache issues. Arguably, we could examine the cost of a branch and always perform this optimization if the cost is high enough, even without a short branch form. That isn't included in the patch. A better patch, would be to have the backend figure out mis-prediction rates, do the math and decide based upon all costs alone given all the costs of all the branches. Thoughts?
Index: bb-reorder.c =================================================================== --- bb-reorder.c (revision 1310) +++ bb-reorder.c (revision 1311) @@ -192,6 +192,48 @@ static void fix_edges_for_rarely_execute static void fix_crossing_conditional_branches (void); static void fix_crossing_unconditional_branches (void); +/* Return true iff we should prefer the less frequent code + of basic block BB to be placed immediately after A. This + is useful when the target has a lower cost branch instruction + that can avoid the misprediction costs of a normal branch. */ +static bool +short_branch_cheaper (const_basic_block bb ATTRIBUTE_UNUSED) +{ +#ifdef TARGET_SHORT_BRANCH_CHEAPER + basic_block pbb, nbb; + edge e; + edge_iterator ei; + bool found = false; + + /* When the have: + A + |\ + | C + |/ + B + we check the port to see if TARGET_SHORT_BRANCH_CHEAPER (A) is + true. It can use the frequencies, the misprediction costs and + branch distance to decide. We only handle the simple case of a + single precessor to C (bb), a single successor to C (bb). */ + if (!single_pred_p (bb)) + return false; + pbb = single_pred (bb); + if (!single_succ_p (bb)) + return false; + nbb = single_succ (bb); + FOR_EACH_EDGE (e, ei, nbb->preds) + if (e->src == pbb) + found = true; + if (!found) + return false; + + if (TARGET_SHORT_BRANCH_CHEAPER (BB_END (pbb))) + return true; +#endif + + return false; +} + /* Check to see if bb should be pushed into the next round of trace collections or not. Reasons for pushing the block forward are 1). If the block is cold, we are doing partitioning, and there will be @@ -214,7 +256,8 @@ push_to_next_round_p (const_basic_block || probably_never_executed_bb_p (bb)); if (there_exists_another_round - && block_not_hot_enough) + && block_not_hot_enough + && !short_branch_cheaper (bb)) return true; else return false; @@ -699,7 +742,8 @@ find_traces_1_round (int branch_th, int & EDGE_CAN_FALLTHRU) && !(single_succ_edge (e->dest)->flags & EDGE_COMPLEX) && single_succ (e->dest) == best_edge->dest - && 2 * e->dest->frequency >= EDGE_FREQUENCY (best_edge)) + && (2 * e->dest->frequency >= EDGE_FREQUENCY (best_edge) + || short_branch_cheaper (e->dest))) { best_edge = e; if (dump_file)