https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80520

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jakub at gcc dot gnu.org,
                   |                            |law at gcc dot gnu.org

--- Comment #7 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
This changed (looking at the #c6 testcase) with r242550, and the change was
intentional, tree if conversion except when it enables successful vectorization
is usually harmful, only rarely useful, but this is one of such cases.  You can
get back the previous behavior with -O3 -ftree-loop-if-convert.

The reason why the RTL CE stuff doesn't do anything here is that we end up with
multiple statements in the if block, so instead of:
  if (_20 != 0)
    goto <bb 5>;
  else
    goto <bb 6>;

  <bb 5>:
  _18 = _25 ^ 2567483615;

  <bb 6>:
  # _200 = PHI <_25(4), _18(4)>
  MEM[base: _3, offset: 0B] = _200;
  ivtmp.11_29 = ivtmp.11_30 + 8;
  if (_6 == ivtmp.11_29)
    goto <bb 7>;
  else
    goto <bb 3>;

we get:
  if (_20 != 0)
    goto <bb 5>;
  else
    goto <bb 6>;

  <bb 5>:
  _18 = _25 ^ 2567483615;
  MEM[base: _3, offset: 0B] = _18;
  ivtmp.11_42 = ivtmp.11_30 + 8;
  if (_6 == ivtmp.11_42)
    goto <bb 7>;
  else
    goto <bb 3>;

  <bb 6>:
  MEM[base: _3, offset: 0B] = _25;
  ivtmp.11_29 = ivtmp.11_30 + 8;
  if (_6 == ivtmp.11_29)
    goto <bb 7>;
  else
    goto <bb 3>;

This is created by dom, I wonder what benefit is in this case.  Even if we
don't improve it in ifcvt.c, if we can make the bb 5 with just the xor
fallthrough into bb 6, i.e. the conditional branch just jumps over the (single
insn), then that looks more beneficial to duplicating more stmts.  Jeff?
Though, seems that multiple passes are keen on doing this kind of stuff, so
in order to avoid that I have to use:
-O3 -fno-tree-dominator-opts -fno-tree-vrp -fno-split-paths
With that we get:
        andl    $1, %edx
        je      .L2
        xorq    %r8, %rax
.L2:
        movq    %rax, (%rdi)
        addq    $8, %rdi
        cmpq    %rdi, %rsi
        jne     .L3
which is IMHO better, but still not the cmov.

The conditional block contains in that case:
(insn 20 19 21 5 (set (reg:DI 105)
        (const_int 2567483615 [0x9908b0df])) 85 {*movdi_internal}
     (nil))
(insn 21 20 22 5 (parallel [
            (set (reg:DI 93 [ _25 ])
                (xor:DI (reg:DI 93 [ _25 ])
                    (reg:DI 105)))
            (clobber (reg:CC 17 flags))
        ]) 443 {*xordi_1}
     (expr_list:REG_DEAD (reg:DI 105)
        (expr_list:REG_UNUSED (reg:CC 17 flags)
            (nil))))
which is too much for ifcvt.
If I change the #c6 testcase to:
void foo(unsigned long *M)
{
  for (unsigned long k = 0; k < 227; ++k)
    {
      unsigned long y =
        ((M[k] & 0xffffffff80000000) | (M[k + 1] & 0x7fffffff));
      M[k] = (M[k + 397] ^ (y >> 1) ^ ((y & 1) ? 567483615 : 0));
    }
}
so that the immediate fits into x86_64 signed 32-bit immediate, then we have
just:
(insn 20 19 21 5 (parallel [
            (set (reg:DI 93 [ _25 ])
                (xor:DI (reg:DI 93 [ _25 ])
                    (const_int 567483615 [0x21d31cdf])))
            (clobber (reg:CC 17 flags))
        ]) 443 {*xordi_1}
     (expr_list:REG_UNUSED (reg:CC 17 flags)
        (nil)))
in the conditional block and ifcvt.c can deal with that and we get:
        movq    %rax, %rsi
        xorq    $567483615, %rsi
        andl    $1, %edx
        cmovne  %rsi, %rax
(of course disabling the jump threading and path splitting is still needed for
this).  So, if we can do something about those, perhaps we could extend ifcvt
so that it could deal with a set of a pseudo to a constant needed for the
following instruction too and take it into account in the costs.

Reply via email to