[Bug tree-optimization/119069] New: 519.lbm_r runs 60% slower with -Ofast -flto -march=znver5 on an AMD Zen5 machine than when compiled with GCC 14 (or with -march=znver4)

jamborm at gcc dot gnu.org via Gcc-bugs Fri, 28 Feb 2025 09:03:20 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119069


            Bug ID: 119069
           Summary: 519.lbm_r runs 60% slower with -Ofast -flto
                    -march=znver5 on an AMD Zen5 machine than when
                    compiled with GCC 14 (or with -march=znver4)
           Product: gcc
           Version: 15.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: jamborm at gcc dot gnu.org
                CC: hubicka at gcc dot gnu.org, rguenth at gcc dot gnu.org,
                    venkataramanan.kumar at amd dot com,
                    vivekanand.devworks at gmail dot com
            Blocks: 26163
  Target Milestone: ---
              Host: x86_64-linux-gnu
            Target: x86_64-linux-gnu

When evaluating the performance of GCC 15 in development (revision
r15-7587-g9335ff73a509a1), we noticed that the binary it produced for
519.lbm_r SPEC FPrate 2017 benchmark when compiled with options -Ofast
-flto -march=native runs 60% slower than when the benchmark is
compiled with the same compiler options with GCC 14.  It also runs 60%
slower than when using -march=znver4.

Mindlessly bisecting the slow-down led me to r15-4787-gacba8b3d8dec01
(Kugan Vivekanandarajah: [PATCH] Fix SLP when ifcvt versioned loop is
not vectorized) but there may be nothing wrong with that particular
revision.

The output of -fopt-info-vec before at this revision and at the one
just before is the same:

lbm.c:58:21: optimized: basic block part vectorized using 64 byte vectors
lbm.c:451:23: optimized: basic block part vectorized using 64 byte vectors
lbm.c:469:23: optimized: basic block part vectorized using 16 byte vectors
lbm.c:549:23: optimized: basic block part vectorized using 64 byte vectors


The (fast) revision preceeding the one I bisected to gives the
following perf stat and perf report:

 Performance counter stats for 'taskset -c 0 specinvoke':

          72577.07 msec task-clock:u                     #    1.000 CPUs
utilized
                 0      context-switches:u               #    0.000 /sec
                 0      cpu-migrations:u                 #    0.000 /sec
              2821      page-faults:u                    #   38.869 /sec
      300151137925      cycles:u                         #    4.136 GHz
        5398418998      stalled-cycles-frontend:u        #    1.80% frontend
cycles idle
     1010879145312      instructions:u                   #    3.37  insn per
cycle
                                                  #    0.01  stalled cycles per
insn
       12020385210      branches:u                       #  165.622 M/sec
          10175147      branch-misses:u                  #    0.08% of all
branches

      72.599511146 seconds time elapsed

      72.547000000 seconds user
       0.027927000 seconds sys


# Total Lost Samples: 0
#
# Samples: 294K of event 'cycles:Pu'
# Event count (approx.): 300657980464
#
# Overhead       Samples  Command          Shared Object                Symbol
# ........  ............  ...............  ........................... 
....................................
#
    99.44%        292503  lbm_r_peak.mine  lbm_r_peak.mine-lto-nat-m64  [.]
main
     0.53%          1557  lbm_r_peak.mine  lbm_r_peak.mine-lto-nat-m64  [.]
LBM_showGridStatistics
     0.01%            22  lbm_r_peak.mine  [unknown]                    [k]
0xffffffffbce015f4
     0.01%            85  lbm_r_peak.mine  lbm_r_peak.mine-lto-nat-m64  [.]
LBM_initializeGrid
     0.01%            21  lbm_r_peak.mine  libc.so.6                    [.]
_IO_getc
     0.00%             8  lbm_r_peak.mine  [unknown]                    [k]
0xffffffffbce015f0
     0.00%             7  lbm_r_peak.mine  lbm_r_peak.mine-lto-nat-m64  [.]
LBM_initializeSpecialCellsForLDC
     0.00%             4  lbm_r_peak.mine  [unknown]                    [k]
0xffffffffbcd89bf5
     0.00%             5  lbm_r_peak.mine  lbm_r_peak.mine-lto-nat-m64  [.]
LBM_loadObstacleFile  


When using the first slow one I got the following (with noticeably
more mispredicted branches):

Performance counter stats for 'taskset -c 0 specinvoke':

         114086.14 msec task-clock:u                     #    1.000 CPUs
utilized
                 0      context-switches:u               #    0.000 /sec
                 0      cpu-migrations:u                 #    0.000 /sec
              3227      page-faults:u                    #   28.286 /sec
      471846517904      cycles:u                         #    4.136 GHz
        5174868157      stalled-cycles-frontend:u        #    1.10% frontend
cycles idle
     1000081416714      instructions:u                   #    2.12  insn per
cycle
                                                  #    0.01  stalled cycles per
insn
       12020398015      branches:u                       #  105.362 M/sec
          43419038      branch-misses:u                  #    0.36% of all
branches

     114.119386152 seconds time elapsed

     114.053784000 seconds user
       0.029592000 seconds sys

# Total Lost Samples: 0
#
# Samples: 462K of event 'cycles:Pu'
# Event count (approx.): 472701306944
#
# Overhead   Samples  Command          Shared Object                Symbol
# ........  ........  ...............  ........................... 
....................................
#
    99.64%    460585  lbm_r_peak.mine  lbm_r_peak.mine-lto-nat-m64  [.] main
     0.33%      1546  lbm_r_peak.mine  lbm_r_peak.mine-lto-nat-m64  [.]
LBM_showGridStatistics
     0.01%        37  lbm_r_peak.mine  [unknown]                    [k]
0xffffffffbce015f0
     0.01%        29  lbm_r_peak.mine  [unknown]                    [k]
0xffffffffbce015f4
     0.01%       103  lbm_r_peak.mine  lbm_r_peak.mine-lto-nat-m64  [.]
LBM_initializeGrid
     0.00%        19  lbm_r_peak.mine  libc.so.6                    [.]
_IO_getc
     0.00%         8  lbm_r_peak.mine  [unknown]                    [k]
0xffffffffbcd89bd4
     0.00%         7  lbm_r_peak.mine  lbm_r_peak.mine-lto-nat-m64  [.]
LBM_initializeSpecialCellsForLDC


According to Richi, this does not reproduce on Zen4 with -march=znver4
-mtune=znver5.

According to Honza, -fno-schedule-insns2 makes the regression to go
away, and so said he suspected some bad luck with micro-op cache.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163
[Bug 26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)

[Bug tree-optimization/119069] New: 519.lbm_r runs 60% slower with -Ofast -flto -march=znver5 on an AMD Zen5 machine than when compiled with GCC 14 (or with -march=znver4)

Reply via email to