https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80520
Jakub Jelinek <jakub at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |jakub at gcc dot gnu.org, | |law at gcc dot gnu.org --- Comment #7 from Jakub Jelinek <jakub at gcc dot gnu.org> --- This changed (looking at the #c6 testcase) with r242550, and the change was intentional, tree if conversion except when it enables successful vectorization is usually harmful, only rarely useful, but this is one of such cases. You can get back the previous behavior with -O3 -ftree-loop-if-convert. The reason why the RTL CE stuff doesn't do anything here is that we end up with multiple statements in the if block, so instead of: if (_20 != 0) goto <bb 5>; else goto <bb 6>; <bb 5>: _18 = _25 ^ 2567483615; <bb 6>: # _200 = PHI <_25(4), _18(4)> MEM[base: _3, offset: 0B] = _200; ivtmp.11_29 = ivtmp.11_30 + 8; if (_6 == ivtmp.11_29) goto <bb 7>; else goto <bb 3>; we get: if (_20 != 0) goto <bb 5>; else goto <bb 6>; <bb 5>: _18 = _25 ^ 2567483615; MEM[base: _3, offset: 0B] = _18; ivtmp.11_42 = ivtmp.11_30 + 8; if (_6 == ivtmp.11_42) goto <bb 7>; else goto <bb 3>; <bb 6>: MEM[base: _3, offset: 0B] = _25; ivtmp.11_29 = ivtmp.11_30 + 8; if (_6 == ivtmp.11_29) goto <bb 7>; else goto <bb 3>; This is created by dom, I wonder what benefit is in this case. Even if we don't improve it in ifcvt.c, if we can make the bb 5 with just the xor fallthrough into bb 6, i.e. the conditional branch just jumps over the (single insn), then that looks more beneficial to duplicating more stmts. Jeff? Though, seems that multiple passes are keen on doing this kind of stuff, so in order to avoid that I have to use: -O3 -fno-tree-dominator-opts -fno-tree-vrp -fno-split-paths With that we get: andl $1, %edx je .L2 xorq %r8, %rax .L2: movq %rax, (%rdi) addq $8, %rdi cmpq %rdi, %rsi jne .L3 which is IMHO better, but still not the cmov. The conditional block contains in that case: (insn 20 19 21 5 (set (reg:DI 105) (const_int 2567483615 [0x9908b0df])) 85 {*movdi_internal} (nil)) (insn 21 20 22 5 (parallel [ (set (reg:DI 93 [ _25 ]) (xor:DI (reg:DI 93 [ _25 ]) (reg:DI 105))) (clobber (reg:CC 17 flags)) ]) 443 {*xordi_1} (expr_list:REG_DEAD (reg:DI 105) (expr_list:REG_UNUSED (reg:CC 17 flags) (nil)))) which is too much for ifcvt. If I change the #c6 testcase to: void foo(unsigned long *M) { for (unsigned long k = 0; k < 227; ++k) { unsigned long y = ((M[k] & 0xffffffff80000000) | (M[k + 1] & 0x7fffffff)); M[k] = (M[k + 397] ^ (y >> 1) ^ ((y & 1) ? 567483615 : 0)); } } so that the immediate fits into x86_64 signed 32-bit immediate, then we have just: (insn 20 19 21 5 (parallel [ (set (reg:DI 93 [ _25 ]) (xor:DI (reg:DI 93 [ _25 ]) (const_int 567483615 [0x21d31cdf]))) (clobber (reg:CC 17 flags)) ]) 443 {*xordi_1} (expr_list:REG_UNUSED (reg:CC 17 flags) (nil))) in the conditional block and ifcvt.c can deal with that and we get: movq %rax, %rsi xorq $567483615, %rsi andl $1, %edx cmovne %rsi, %rax (of course disabling the jump threading and path splitting is still needed for this). So, if we can do something about those, perhaps we could extend ifcvt so that it could deal with a set of a pseudo to a constant needed for the following instruction too and take it into account in the costs.