[Bug tree-optimization/94427] 456.hmmer is 8-17% slower when compiled at -Ofast than with GCC 9

jamborm at gcc dot gnu.org Tue, 31 Mar 2020 16:12:40 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94427


--- Comment #1 from Martin Jambor <jamborm at gcc dot gnu.org> ---
OK, so it turns out the identified commit only allows us to shoot
ourselves in the foot - and there one too few branches, not too many.

The hottest loop, consuming most of the time is:

Percent         Instructions
------------------------------------------------
  0.03 │ fb0:┌─+add     -0x8(%r9,%rcx,4),%eax
  5.03 │     │  mov     %eax,-0x4(%r13,%rcx,4)
  2.48 │     │  mov     -0x8(%r8,%rcx,4),%esi
  0.02 │     │  add     -0x8(%rdx,%rcx,4),%esi
  0.06 │     │  cmp     %eax,%esi
  4.49 │     │  cmovge  %esi,%eax
 17.17 │     │  mov     %ecx,%esi
  0.03 │     │  cmp     $0xc521974f,%eax
  3.50 │     │  cmovl   %ebx,%eax   <----------- this used to be a branch
 21.84 │     │  mov     %eax,-0x4(%r13,%rcx,4)
  3.88 │     │  add     $0x1,%rcx
  0.00 │     │  cmp     %rdi,%rcx
  0.04 │     └──jne     fb0

where the marked conditional move was a branch one revision before,
because, after fwprop3 the IL looked like:

  <bb 16> [local count: 955630217]:
  # cstore_281 = PHI <[fast_algorithms.c:142:53] sc_223(14),
[fast_algorithms.c:142:53] cstore_249(15)>
  [fast_algorithms.c:142:49] MEM <int> [(void *)_72] = cstore_281;
  [fast_algorithms.c:143:13] _78 = [fast_algorithms.c:143:13] *_72;
  [fast_algorithms.c:143:10] if (_78 < -987654321)
    goto <bb 18>; [50.00%]
  else
    goto <bb 17>; [50.00%]

  <bb 17> [local count: 477815109]:

  <bb 18> [local count: 955630217]:
  # cstore_250 = PHI <[fast_algorithms.c:143:33] -987654321(16),
[fast_algorithms.c:143:33] cstore_281(17)>
  [fast_algorithms.c:143:29] MEM <int> [(void *)_72] = cstore_250;

The aforementioned revision turned this into more optimized code:

  <bb 16> [local count: 955630217]:
  # cstore_281 = PHI <[fast_algorithms.c:142:53] sc_223(14),
[fast_algorithms.c:142:53] _73(15)>
  [fast_algorithms.c:143:10] if (cstore_281 < -987654321)
    goto <bb 18>; [50.00%]
  else
    goto <bb 17>; [50.00%]

  <bb 17> [local count: 477815109]:

  <bb 18> [local count: 955630217]:
  # cstore_250 = PHI <[fast_algorithms.c:143:33] -987654321(16),
[fast_algorithms.c:143:33] cstore_281(17)>
  [fast_algorithms.c:143:29] MEM <int> [(void *)_72] = cstore_250;

Which then phiopt3 changed to:

  cstore_248 = MAX_EXPR <cstore_249, -987654321>;
  [fast_algorithms.c:143:29] MEM <int> [(void *)_72] = cstore_248;

and expander apparently always expands MAX_EXPR into a conditional
move if it can(?).

When I hacked phiopt not to do the transformation for - ehm - any
GIMPLE_COND statement originating from source line 143, I recovered
the original run-time of the benchmark.  On both AMD and Intel.

[Bug tree-optimization/94427] 456.hmmer is 8-17% slower when compiled at -Ofast than with GCC 9

Reply via email to