https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119069
Bug ID: 119069 Summary: 519.lbm_r runs 60% slower with -Ofast -flto -march=znver5 on an AMD Zen5 machine than when compiled with GCC 14 (or with -march=znver4) Product: gcc Version: 15.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: jamborm at gcc dot gnu.org CC: hubicka at gcc dot gnu.org, rguenth at gcc dot gnu.org, venkataramanan.kumar at amd dot com, vivekanand.devworks at gmail dot com Blocks: 26163 Target Milestone: --- Host: x86_64-linux-gnu Target: x86_64-linux-gnu When evaluating the performance of GCC 15 in development (revision r15-7587-g9335ff73a509a1), we noticed that the binary it produced for 519.lbm_r SPEC FPrate 2017 benchmark when compiled with options -Ofast -flto -march=native runs 60% slower than when the benchmark is compiled with the same compiler options with GCC 14. It also runs 60% slower than when using -march=znver4. Mindlessly bisecting the slow-down led me to r15-4787-gacba8b3d8dec01 (Kugan Vivekanandarajah: [PATCH] Fix SLP when ifcvt versioned loop is not vectorized) but there may be nothing wrong with that particular revision. The output of -fopt-info-vec before at this revision and at the one just before is the same: lbm.c:58:21: optimized: basic block part vectorized using 64 byte vectors lbm.c:451:23: optimized: basic block part vectorized using 64 byte vectors lbm.c:469:23: optimized: basic block part vectorized using 16 byte vectors lbm.c:549:23: optimized: basic block part vectorized using 64 byte vectors The (fast) revision preceeding the one I bisected to gives the following perf stat and perf report: Performance counter stats for 'taskset -c 0 specinvoke': 72577.07 msec task-clock:u # 1.000 CPUs utilized 0 context-switches:u # 0.000 /sec 0 cpu-migrations:u # 0.000 /sec 2821 page-faults:u # 38.869 /sec 300151137925 cycles:u # 4.136 GHz 5398418998 stalled-cycles-frontend:u # 1.80% frontend cycles idle 1010879145312 instructions:u # 3.37 insn per cycle # 0.01 stalled cycles per insn 12020385210 branches:u # 165.622 M/sec 10175147 branch-misses:u # 0.08% of all branches 72.599511146 seconds time elapsed 72.547000000 seconds user 0.027927000 seconds sys # Total Lost Samples: 0 # # Samples: 294K of event 'cycles:Pu' # Event count (approx.): 300657980464 # # Overhead Samples Command Shared Object Symbol # ........ ............ ............... ........................... .................................... # 99.44% 292503 lbm_r_peak.mine lbm_r_peak.mine-lto-nat-m64 [.] main 0.53% 1557 lbm_r_peak.mine lbm_r_peak.mine-lto-nat-m64 [.] LBM_showGridStatistics 0.01% 22 lbm_r_peak.mine [unknown] [k] 0xffffffffbce015f4 0.01% 85 lbm_r_peak.mine lbm_r_peak.mine-lto-nat-m64 [.] LBM_initializeGrid 0.01% 21 lbm_r_peak.mine libc.so.6 [.] _IO_getc 0.00% 8 lbm_r_peak.mine [unknown] [k] 0xffffffffbce015f0 0.00% 7 lbm_r_peak.mine lbm_r_peak.mine-lto-nat-m64 [.] LBM_initializeSpecialCellsForLDC 0.00% 4 lbm_r_peak.mine [unknown] [k] 0xffffffffbcd89bf5 0.00% 5 lbm_r_peak.mine lbm_r_peak.mine-lto-nat-m64 [.] LBM_loadObstacleFile When using the first slow one I got the following (with noticeably more mispredicted branches): Performance counter stats for 'taskset -c 0 specinvoke': 114086.14 msec task-clock:u # 1.000 CPUs utilized 0 context-switches:u # 0.000 /sec 0 cpu-migrations:u # 0.000 /sec 3227 page-faults:u # 28.286 /sec 471846517904 cycles:u # 4.136 GHz 5174868157 stalled-cycles-frontend:u # 1.10% frontend cycles idle 1000081416714 instructions:u # 2.12 insn per cycle # 0.01 stalled cycles per insn 12020398015 branches:u # 105.362 M/sec 43419038 branch-misses:u # 0.36% of all branches 114.119386152 seconds time elapsed 114.053784000 seconds user 0.029592000 seconds sys # Total Lost Samples: 0 # # Samples: 462K of event 'cycles:Pu' # Event count (approx.): 472701306944 # # Overhead Samples Command Shared Object Symbol # ........ ........ ............... ........................... .................................... # 99.64% 460585 lbm_r_peak.mine lbm_r_peak.mine-lto-nat-m64 [.] main 0.33% 1546 lbm_r_peak.mine lbm_r_peak.mine-lto-nat-m64 [.] LBM_showGridStatistics 0.01% 37 lbm_r_peak.mine [unknown] [k] 0xffffffffbce015f0 0.01% 29 lbm_r_peak.mine [unknown] [k] 0xffffffffbce015f4 0.01% 103 lbm_r_peak.mine lbm_r_peak.mine-lto-nat-m64 [.] LBM_initializeGrid 0.00% 19 lbm_r_peak.mine libc.so.6 [.] _IO_getc 0.00% 8 lbm_r_peak.mine [unknown] [k] 0xffffffffbcd89bd4 0.00% 7 lbm_r_peak.mine lbm_r_peak.mine-lto-nat-m64 [.] LBM_initializeSpecialCellsForLDC According to Richi, this does not reproduce on Zen4 with -march=znver4 -mtune=znver5. According to Honza, -fno-schedule-insns2 makes the regression to go away, and so said he suspected some bad luck with micro-op cache. Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163 [Bug 26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)