[Bug tree-optimization/121332] New: [16 Regression] 8-16% slowdown of 519.lbm_r on AMD Zen 2 since r16-2601-ge8a51144c02e1c

pheeck at gcc dot gnu.org via Gcc-bugs Thu, 31 Jul 2025 03:06:29 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121332


            Bug ID: 121332
           Summary: [16 Regression] 8-16% slowdown of 519.lbm_r on AMD Zen
                    2 since r16-2601-ge8a51144c02e1c
           Product: gcc
           Version: 16.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: pheeck at gcc dot gnu.org
                CC: rguenth at gcc dot gnu.org
            Blocks: 26163
  Target Milestone: ---
              Host: x86_64-linux
            Target: x86_64-linux

As seen here

https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=286.477.0

there was an 8% (16% on another machine I measured separately) exec time
slowdown of the 519.lbm_r SPEC 2017
benchmark when run with -Ofast -march=native -flto -fprofile-use on an 
AMD Zen2 machine.
I bisected it to r16-2601-ge8a51144c02e1c.

e8a51144c02e1cf210db5763e435802ac6fa6ad9 is the first bad commit
commit e8a51144c02e1cf210db5763e435802ac6fa6ad9
Author: Richard Biener <rguent...@suse.de>
Date:   Tue Jul 29 10:05:32 2025 +0200

    tree-optimization/120687 - avoid disturbing reduction chains in reassoc

    Reassoc carefully ranks operands to form reduction chains for
    vectorization so we are careful to not apply any width related
    changes in the early pass.  Unfortunately we are not careful
    enough.  The following gates fma related re-ordering and also
    the >= 3 ops tail "optimization" which is the culprit here.

    This does not fix the reported inefficient vectorization when
    using signed integer reductions yet.

            PR tree-optimization/120687
            * tree-ssa-reassoc.cc (reassociate_bb): Do not disturb
            the sorted operand order in the early pass.
            * tree-vect-slp.cc (vect_analyze_slp): Dump when a detected
            reduction chain fails SLP discovery.

            * gcc.dg/vect/pr120687-1.c: New testcase.
            * gcc.dg/vect/pr120687-2.c: Likewise.

 gcc/testsuite/gcc.dg/vect/pr120687-1.c | 16 ++++++++++++++++
 gcc/testsuite/gcc.dg/vect/pr120687-2.c | 17 +++++++++++++++++
 gcc/tree-ssa-reassoc.cc                | 10 ++++++----
 gcc/tree-vect-slp.cc                   |  3 +++
 4 files changed, 42 insertions(+), 4 deletions(-)
 create mode 100644 gcc/testsuite/gcc.dg/vect/pr120687-1.c
 create mode 100644 gcc/testsuite/gcc.dg/vect/pr120687-2.c
bisect found first bad commit


This is a ~4% regression against GCC 15. See the comparison
here:

https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=1077.477.0&plot.1=1207.477.0&plot.2=286.477.0&;


Btw, r16-2601 also introduces an 18-30% speedup with -Ofast -march=native -flto
(so if we drop PGO).

https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=287.477.0

I've wondered if it perhaps helps evade the spill from pr120941.  However,
that's not the case.  I still see the spill in the binary.

Anyway, the commit seems like a net gain performance-wise.  I'm just reporting
that there possibly is some room to improve the PGO case.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163
[Bug 26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)

[Bug tree-optimization/121332] New: [16 Regression] 8-16% slowdown of 519.lbm_r on AMD Zen 2 since r16-2601-ge8a51144c02e1c

Reply via email to