https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89120
--- Comment #2 from Antony Polukhin <antoshkka at gmail dot com> --- Long story short: I've found no way to improve the standard library code to always work faster. I'm in favor of closing this ticket as invalid/wont fix. Long story: I've tried to add a specialization of minmax_element algorithm for std::less comparators and arithmetic types. That specialization was doing more comparisons but in a more predictable way. On big datasets the performance increased, but decreased on small datasets. Then I've tried another approach. If the comparison of __first with __next is barely predictable, then just avoid branching on it. Portable solution: bool __b = __comp(__next, __first); _ForwardIterator __pots[3] = {__first, __next, __first}; _ForwardIterator __pot_min = *(__pots + __b); _ForwardIterator __pot_max = *(__pots + __b + 1); Special case for random access iterators: bool __b = __comp(__next, __first); _ForwardIterator __pot_min = __first, __pot_max = __next; __pot_min += b; __pot_max -= b; Unfortunately both those approaches add some overhead for small datasets. Another disadvantage, is that those approaches produce orthogonal results on different compilers: GCC-9 performance gets better on big datasets ----------------------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------------------- naive_minmax/2 3 ns 3 ns 247522237 naive_minmax/8 7 ns 7 ns 103044422 naive_minmax/262144 1715635 ns 1710406 ns 407 naive_minmax/1048576 6970755 ns 6947034 ns 101 branchless_minmax/2 8 ns 8 ns 81324904 branchless_minmax/8 30 ns 30 ns 23494608 branchless_minmax/262144 457287 ns 456412 ns 1529 branchless_minmax/1048576 4267914 ns 4219969 ns 363 Clang-9 performance degrades on big datasets ----------------------------------------------------------------- Benchmark Time CPU Iterations ----------------------------------------------------------------- naive_minmax/2 2 ns 2 ns 380928404 naive_minmax/8 7 ns 7 ns 92642970 naive_minmax/262144 262921 ns 262288 ns 2630 naive_minmax/1048576 1149407 ns 1147626 ns 618 branchless_minmax/2 2 ns 2 ns 307146020 branchless_minmax/8 10 ns 10 ns 74417142 branchless_minmax/262144 425880 ns 425241 ns 1637 branchless_minmax/1048576 1747785 ns 1745725 ns 397 Final attempt. Different compilers optimize the algorithm differently. Clang shows good performance on big datasets with >4k elements, GCC - on medium sized datasets with 128-1k elements. Maybe providing more info on probabilities could help both compilers to produce better code. But looks like heuristics already deduce the probabilities to be close to 0.5, __builtin_expect_with_probability(__b, true, 0.5) changed nothing in the assembly https://godbolt.org/z/PqWoaKfhW