https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98713

--- Comment #7 from Roger Sayle <roger at nextmovesoftware dot com> ---
I agree in the general case, a conditional jump (that depends only on the
condition flags) potentially has a shorter dependence chain than a cmov (which
depends on the condition flags and two registers).  But in this case, the
condition codes can't be determined any faster than the register operands.

I believe the branch prediction argument is a red herring.  Yes, with clever
hardware and luck, the CPU can predict which instruction will be executed next
after a conditional jump, but it can always know which instruction is executed
next after a cmov.  A cmov (and multiple cmovs) can be scheduled and executed
out-of-order without speculation.  Hence branch prediction is only a factor
when dependency chain lengths are an issue/unequal (or the cmov is slower than
a correctly predicted branch).

An understanding the data distribution is also irrelevant if the best/fastest
(correctly predicted) branch implementation is no better/faster than the cmov.

But I can also imagine microarchitectures where predicted conditional jumps are
free (requiring zero cycles) and where the condition code "test" is eliminated
having been set/forwarded from an earier instruction, in which case a
zero-latency abs is about as good as you can get.  Are we assuming a target
with "predicted_branch_cost < conditional_move_cost"? I wouldn't be surprised
if GCC internally assumes these are both always COSTS_N_INSNS(1).

If conditional_move_cost <= predicted_branch_cost <= mispredicted_branch_cost
then the cmov should always preferred (independent of branch probabilities or
__builtin_expect hints).  If predicted_branch_cost <= mispredicted_branch_cost
<= conditional_move_cost, the branch should always be preferred, and the cmov
shouldn't be part of the ISA.  The interesting domain of trade-offs is
where/when predicted_branch_cost < conditional_move_cost <=
mispredicted_branch_cost (which I'm not yet convinced is the case here).

Do we have any numbers that show the branch is better (for this case) on real
hardware, than can't be explained by other factors?  For example, on ABS where
the inputs are always positive.

Reply via email to