https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119298
Bug ID: 119298 Summary: 538.imagick_r is faster when compiled with GCC 14.2 and -Ofast -flto -march=native than with master on Zen5 Product: gcc Version: 15.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jamborm at gcc dot gnu.org CC: hubicka at gcc dot gnu.org, rguenth at gcc dot gnu.org Blocks: 26163 Target Milestone: --- Host: x86_64-linux-gnu Target: x86_64-linux-gnu SPEC INTrate 2017 538.imagick_r benchmark is faster when compiled with GCC 14.2 and -Ofast -flto -march=native than with trunk/master on Zen 5 CPUs. The regression has been introduced in r15-3441-g4292297a0f938f (Jan Hubicka: Zen5 tuning part 5: update instruction latencies in x86-tune-costs) It is the modification of "cost of ADDSS/SD SUBSS/SD insns" that is the culprit, bumping it back to COSTS_N_INSNS(3) (instead of COSTS_N_INSNS(2)) makes the regression go away. Nevertheless, Honza claims the cost should be correct. Perf stat of the slow run: 116866.57 msec task-clock:u # 1.000 CPUs utilized 0 context-switches:u # 0.000 /sec 0 cpu-migrations:u # 0.000 /sec 8347 page-faults:u # 71.423 /sec 484499860679 cycles:u # 4.146 GHz 21879349058 stalled-cycles-frontend:u # 4.52% frontend cycles idle 2030074730877 instructions:u # 4.19 insn per cycle # 0.01 stalled cycles per insn 224436542157 branches:u # 1.920 G/sec 1716173329 branch-misses:u # 0.76% of all branches 116.881252465 seconds time elapsed 116.808499000 seconds user 0.057350000 seconds sys Perf report of the slow run (annotated assmebly attached): # Samples: 470K of event 'cycles:Pu' # Event count (approx.): 484158470552 # # Overhead Samples Command Shared Object Symbol # ........ ............ ............... ............................... ............................................. # 44.71% 210348 imagick_r_peak. imagick_r_peak.mine-lto-nat-m64 [.] MeanShiftImage 28.76% 135308 imagick_r_peak. imagick_r_peak.mine-lto-nat-m64 [.] GetVirtualPixelsFromNexus 25.50% 120106 imagick_r_peak. imagick_r_peak.mine-lto-nat-m64 [.] MorphologyApply Perf stat of the fast run (with just the one cost reverted): Performance counter stats for 'taskset -c 0 specinvoke': 108805.48 msec task-clock:u # 1.000 CPUs utilized 0 context-switches:u # 0.000 /sec 0 cpu-migrations:u # 0.000 /sec 8312 page-faults:u # 76.393 /sec 450981792793 cycles:u # 4.145 GHz 22610930072 stalled-cycles-frontend:u # 5.01% frontend cycles idle 1933965750890 instructions:u # 4.29 insn per cycle # 0.01 stalled cycles per insn 224433996552 branches:u # 2.063 G/sec 1721069495 branch-misses:u # 0.77% of all branches 108.819368844 seconds time elapsed 108.763582000 seconds user 0.041314000 seconds sys Perf report of the fast run (annotated assmebly attached): # Samples: 427K of event 'cycles:Pu' # Event count (approx.): 439380128661 # # Overhead Samples Command Shared Object Symbol # ........ ............ ............... ............................... .................................................. # 44.53% 190164 imagick_r_peak. imagick_r_peak.mine-lto-nat-m64 [.] MeanShiftImage 28.13% 120243 imagick_r_peak. imagick_r_peak.mine-lto-nat-m64 [.] MorphologyApply 26.20% 111906 imagick_r_peak. imagick_r_peak.mine-lto-nat-m64 [.] GetVirtualPixelsFromNexus Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163 [Bug 26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)