https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78200
Bug ID: 78200
Summary: [7 regression]: 429.mcf of cpu2006 regresses in GCC
trunk for avx2 target.
Product: gcc
Version: tree-ssa
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: venkataramanan.kumar at amd dot com
Target Milestone: ---
Noticed 5% regression with 429.mcf of cpu2006 on x86_64 AVX2 (bdver4) with GCC
trunk gcc version 7.0.0 20161028 (experimental) (GCC).
Flag used is -O3 -mavx2 -mprefer-avx128
Not seen with GCC 6.1 or with GCC trunk for -O3 -mavx -mprefer-avx128
Assembly difference is observed in hot function primal_bea_mpp of pbeampp.c.
-O3 -mavx -mprefer-avx128 -O3 -mavx2 -mprefer-avx128
.L98: | .L98:
------------------------------------| jle .L97 <== order of
comparison
cmpl $2, %r9d | cmpl $2, %r9d is
different.
jne .L97 | jne .L97
testq %rdi, %rdi | -----------------------------------
jle .L97 | -----------------------------------
.L99: | .L99:
addq $1, %r13 | addq $1, %r13
movq %rdi, %r12 | movq %rdi, %r12
movq perm(,%r13,8), %r9 | movq perm(,%r13,8), %r9
sarq $63, %r12 | sarq $63, %r12
movq %rdi, 8(%r9) | movq %rdi, 8(%r9)
+ +-- 12 lines: xorq %r12, %rdi-------|+ +-- 12 lines: xorq %r12, %rdi------
jle .L97 | jle .L97
movq 8(%rax), %r14 | movq 8(%rax), %r14
movq (%rax), %rdi | movq (%rax), %rdi
subq (%r14), %rdi | subq (%r14), %rdi
movq 16(%rax), %r14 | movq 16(%rax), %r14
addq (%r14), %rdi | addq (%r14), %rdi
jns .L98 | cmpq $0, %rdi
------------------------------------| jge .L98
Gimple optimzied dump shows
GCC trunk -O3 -mavx -mprefer-avx128
;; basic block 20, loop depth 2, count 0, freq 1067, maybe hot
;; Invalid sum of incoming frequencies 1216, should be 1067
;; prev block 19, next block 21, flags: (NEW, REACHABLE, VISITED)
;; pred: 18 [64.0%] (FALSE_VALUE,EXECUTABLE)
# RANGE [0, 1]
_496 = _512 == 2;
# RANGE [0, 1]
_495 = red_cost_503 > 0;
# RANGE [0, 1]
_494 = _495 & _496;
if (_494 != 0)
goto <bb 21>;
else
goto <bb 22>;
GCC trunk -O3 -mavx2 -mprefer-avx128
;; basic block 20, loop depth 2, count 0, freq 1067, maybe hot
;; Invalid sum of incoming frequencies 1216, should be 1067
;; prev block 19, next block 21, flags: (NEW, REACHABLE, VISITED)
;; pred: 18 [64.0%] (FALSE_VALUE,EXECUTABLE)
# RANGE [0, 1]
_496 = _512 == 2;
# RANGE [0, 1]
_495 = red_cost_503 > 0;
# RANGE [0, 1]
_494 = _495 & _496; <== operation order is different on AVX2.
if (_494 != 0)
goto <bb 21>;
else
goto <bb 22>;
operation order is changed at pbeampp.c.171t.reassoc2.
;; basic block 20, loop depth 2, count 0, freq 1067, maybe hot
;; Invalid sum of incoming frequencies 1216, should be 1067
;; prev block 19, next block 21, flags: (NEW, REACHABLE, VISITED)
;; pred: 18 [64.0%] (FALSE_VALUE,EXECUTABLE)
_496 = _512 == 2;
_495 = red_cost_503 > 0;
_494 = _495 & _496;
if (_494 != 0)
goto <bb 21>;
else
goto <bb 22>;
Looking backwards further, found that in tree if conversion generates
non-canonical gimple.
pbeampp.c.155t.ifcvt
;; basic block 27, loop depth 2, count 0, freq 1067, maybe hot
;; Invalid sum of incoming frequencies 1216, should be 1067
;; prev block 26, next block 28, flags: (NEW, REACHABLE, VISITED)
;; pred: 25 [64.0%] (FALSE_VALUE,EXECUTABLE)
_496 = _512 == 2;
_495 = red_cost_503 > 0;
_494 = _496 & _495; <== comparison order is same but LHS of "&" has a
greater number.
if (_494 != 0)
goto <bb 28>;
else
goto <bb 29>;
pbeampp.c.154t.ch_vect
;; basic block 23, loop depth 2, count 0, freq 1067, maybe hot
;; Invalid sum of incoming frequencies 1216, should be 1067
;; prev block 22, next block 24, flags: (NEW, REACHABLE, VISITED)
;; pred: 21 [64.0%] (FALSE_VALUE,EXECUTABLE)
_340 = _23 == 2;
_341 = red_cost_86 > 0;
_338 = _340 & _341; <== comparison order is same here.
if (_338 != 0)
goto <bb 24>;
else
goto <bb 25>;
compiling pbeampp.c with -O3 -mavx2 -mprefer-avx128
-fno-tree-loop-if-conversion
and rest of benchmark changes with -O3 -mavx2 -mprefer-avx128 brings back the
score same as that of
-O3 -mavx or GCC 6.1 -O3 -mavx2.