https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121332
Bug ID: 121332 Summary: [16 Regression] 8-16% slowdown of 519.lbm_r on AMD Zen 2 since r16-2601-ge8a51144c02e1c Product: gcc Version: 16.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: pheeck at gcc dot gnu.org CC: rguenth at gcc dot gnu.org Blocks: 26163 Target Milestone: --- Host: x86_64-linux Target: x86_64-linux As seen here https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=286.477.0 there was an 8% (16% on another machine I measured separately) exec time slowdown of the 519.lbm_r SPEC 2017 benchmark when run with -Ofast -march=native -flto -fprofile-use on an AMD Zen2 machine. I bisected it to r16-2601-ge8a51144c02e1c. e8a51144c02e1cf210db5763e435802ac6fa6ad9 is the first bad commit commit e8a51144c02e1cf210db5763e435802ac6fa6ad9 Author: Richard Biener <rguent...@suse.de> Date: Tue Jul 29 10:05:32 2025 +0200 tree-optimization/120687 - avoid disturbing reduction chains in reassoc Reassoc carefully ranks operands to form reduction chains for vectorization so we are careful to not apply any width related changes in the early pass. Unfortunately we are not careful enough. The following gates fma related re-ordering and also the >= 3 ops tail "optimization" which is the culprit here. This does not fix the reported inefficient vectorization when using signed integer reductions yet. PR tree-optimization/120687 * tree-ssa-reassoc.cc (reassociate_bb): Do not disturb the sorted operand order in the early pass. * tree-vect-slp.cc (vect_analyze_slp): Dump when a detected reduction chain fails SLP discovery. * gcc.dg/vect/pr120687-1.c: New testcase. * gcc.dg/vect/pr120687-2.c: Likewise. gcc/testsuite/gcc.dg/vect/pr120687-1.c | 16 ++++++++++++++++ gcc/testsuite/gcc.dg/vect/pr120687-2.c | 17 +++++++++++++++++ gcc/tree-ssa-reassoc.cc | 10 ++++++---- gcc/tree-vect-slp.cc | 3 +++ 4 files changed, 42 insertions(+), 4 deletions(-) create mode 100644 gcc/testsuite/gcc.dg/vect/pr120687-1.c create mode 100644 gcc/testsuite/gcc.dg/vect/pr120687-2.c bisect found first bad commit This is a ~4% regression against GCC 15. See the comparison here: https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=1077.477.0&plot.1=1207.477.0&plot.2=286.477.0& Btw, r16-2601 also introduces an 18-30% speedup with -Ofast -march=native -flto (so if we drop PGO). https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=287.477.0 I've wondered if it perhaps helps evade the spill from pr120941. However, that's not the case. I still see the spill in the binary. Anyway, the commit seems like a net gain performance-wise. I'm just reporting that there possibly is some room to improve the PGO case. Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163 [Bug 26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)