https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94427
--- Comment #1 from Martin Jambor <jamborm at gcc dot gnu.org> ---
OK, so it turns out the identified commit only allows us to shoot
ourselves in the foot - and there one too few branches, not too many.
The hottest loop, consuming most of the time is:
Percent Instructions
------------------------------------------------
0.03 │ fb0:┌─+add -0x8(%r9,%rcx,4),%eax
5.03 │ │ mov %eax,-0x4(%r13,%rcx,4)
2.48 │ │ mov -0x8(%r8,%rcx,4),%esi
0.02 │ │ add -0x8(%rdx,%rcx,4),%esi
0.06 │ │ cmp %eax,%esi
4.49 │ │ cmovge %esi,%eax
17.17 │ │ mov %ecx,%esi
0.03 │ │ cmp $0xc521974f,%eax
3.50 │ │ cmovl %ebx,%eax <----------- this used to be a branch
21.84 │ │ mov %eax,-0x4(%r13,%rcx,4)
3.88 │ │ add $0x1,%rcx
0.00 │ │ cmp %rdi,%rcx
0.04 │ └──jne fb0
where the marked conditional move was a branch one revision before,
because, after fwprop3 the IL looked like:
<bb 16> [local count: 955630217]:
# cstore_281 = PHI <[fast_algorithms.c:142:53] sc_223(14),
[fast_algorithms.c:142:53] cstore_249(15)>
[fast_algorithms.c:142:49] MEM <int> [(void *)_72] = cstore_281;
[fast_algorithms.c:143:13] _78 = [fast_algorithms.c:143:13] *_72;
[fast_algorithms.c:143:10] if (_78 < -987654321)
goto <bb 18>; [50.00%]
else
goto <bb 17>; [50.00%]
<bb 17> [local count: 477815109]:
<bb 18> [local count: 955630217]:
# cstore_250 = PHI <[fast_algorithms.c:143:33] -987654321(16),
[fast_algorithms.c:143:33] cstore_281(17)>
[fast_algorithms.c:143:29] MEM <int> [(void *)_72] = cstore_250;
The aforementioned revision turned this into more optimized code:
<bb 16> [local count: 955630217]:
# cstore_281 = PHI <[fast_algorithms.c:142:53] sc_223(14),
[fast_algorithms.c:142:53] _73(15)>
[fast_algorithms.c:143:10] if (cstore_281 < -987654321)
goto <bb 18>; [50.00%]
else
goto <bb 17>; [50.00%]
<bb 17> [local count: 477815109]:
<bb 18> [local count: 955630217]:
# cstore_250 = PHI <[fast_algorithms.c:143:33] -987654321(16),
[fast_algorithms.c:143:33] cstore_281(17)>
[fast_algorithms.c:143:29] MEM <int> [(void *)_72] = cstore_250;
Which then phiopt3 changed to:
cstore_248 = MAX_EXPR <cstore_249, -987654321>;
[fast_algorithms.c:143:29] MEM <int> [(void *)_72] = cstore_248;
and expander apparently always expands MAX_EXPR into a conditional
move if it can(?).
When I hacked phiopt not to do the transformation for - ehm - any
GIMPLE_COND statement originating from source line 143, I recovered
the original run-time of the benchmark. On both AMD and Intel.