[Bug tree-optimization/104125] 531.deepsjeng_r regressed on Zen2 CPUs at -Ofast -march=native (without LTO) during GCC 12 development

jamborm at gcc dot gnu.org via Gcc-bugs Tue, 01 Feb 2022 09:43:21 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104125


--- Comment #4 from Martin Jambor <jamborm at gcc dot gnu.org> ---
Despite spending much more time on this than I wanted I was not able
to find out anything really interesting.

The functions that slowed down significantly is feval (FWIW, perf
annotation points down to a conditional jump, depending on a
comparison of 0x78(%rsp) to zero, as a new costly instruction).

I have gone back to the commit that introduced the regression and
added a debug counter to switch between the old and new behavior.  The
single change responsible for the entire slowdown happened in evrp
pass when working on function positional_eval:

@@ -1946,7 +1948,7 @@
   _11 = _9 & _10;
   _95 = PopCount (_11);
   _96 = _95 * 15;
-  _104 = -_96;
+  _104 = _95 * -15;
   _13 = pawntt_84(D)->b_super_strong_square;
   _14 = s_85(D)->BitBoard[4];
   _15 = _13 & _14;

Neither _95 nor _96 has any further uses, and either way, simple
search in dumps suggests that even in the "fast" case, the expression
is folded to multiplication by -15 later anyway.

But from here the investigation is difficult, this change introduces
changes in SSA numbering in later passes and diffs are huge.
Moreover, this also causes change in inlining order (as reported by
-fopt-info-optimized):

--- opt-fast    2022-02-01 17:17:50.928639947 +0100
+++ opt-slow    2022-02-01 17:18:07.284728740 +0100
@@ -4,4 +4,4 @@
 neval.cpp:1086:26: optimized:  Inlined trapped_eval.constprop/209 into void
feval(state_t*, int, t_eval_comps*)/163 which now has time 172.599138 and size
156, net change of -25.
-neval.cpp:1067:22: optimized:  Inlined void kingpressure_eval(state_t*,
attackinfo_t*, t_eval_comps*)/162 into void feval(state_t*, int,
t_eval_comps*)/163 which now has time 216.190938 and size 314, net change of
-31.
-neval.cpp:1081:20: optimized:  Inlined void positional_eval(state_t*,
pawntt_t*, t_eval_comps*)/157 into void feval(state_t*, int, t_eval_comps*)/163
which now has time 314.215938 and size 433, net change of -21.
+neval.cpp:1081:20: optimized:  Inlined void positional_eval(state_t*,
pawntt_t*, t_eval_comps*)/157 into void feval(state_t*, int, t_eval_comps*)/163
which now has time 269.624138 and size 274, net change of -21.
+neval.cpp:1067:22: optimized:  Inlined void kingpressure_eval(state_t*,
attackinfo_t*, t_eval_comps*)/162 into void feval(state_t*, int,
t_eval_comps*)/163 which now has time 313.215938 and size 432, net change of
-31.
 neval.cpp:394:22: optimized: basic block part vectorized using 32 byte vectors

On the assembly level, register allocation, spilling and scheduling
are clearly somewhat different, again creating so much differences
that I cannot tell what is going on from a simple diff.

[Bug tree-optimization/104125] 531.deepsjeng_r regressed on Zen2 CPUs at -Ofast -march=native (without LTO) during GCC 12 development

Reply via email to