https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94369
Bug ID: 94369 Summary: 505.mcf_r is 6-7% slower at -Ofast -march=native with PGO+LTO than with just LTO Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: gcov-profile Assignee: unassigned at gcc dot gnu.org Reporter: jamborm at gcc dot gnu.org CC: marxin at gcc dot gnu.org Blocks: 26163 Target Milestone: --- Host: x86_64-linux Target: x86_64-linux SPEC 2017 INTrate benchmark 505.mcf_r, when compiled with options -Ofast -march=native -mtune=native, is 6-7% slower when compiled with both PGO and LTO than when built with just LTO. I have observed this on both AMD Zen2 (7%) and Intel Cascade Lake (6%) server CPUs. The train run cannot be very bad because without LTO, PGO improves run-time by 15% on both systems. This is with master revision 26b3e568a60. Profiling results (from an AMD CPU): LTO: Overhead Samples Shared Object Symbol ........ ......... ............... ........................ 39.53% 518450 mcf_r_peak.mine spec_qsort.constprop.0 22.13% 289745 mcf_r_peak.mine master.constprop.0 19.00% 248641 mcf_r_peak.mine replace_weaker_arc 9.37% 122669 mcf_r_peak.mine main 8.60% 112601 mcf_r_peak.mine spec_qsort.constprop.1 PGO+LTO: Overhead Samples Shared Object Symbol ........ ......... ............... ....................................... 40.13% 562770 mcf_r_peak.mine spec_qsort.constprop.0 21.68% 303543 mcf_r_peak.mine master.constprop.0 18.24% 255236 mcf_r_peak.mine replace_weaker_arc 10.32% 144433 mcf_r_peak.mine main 8.07% 112775 mcf_r_peak.mine arc_compare Perhaps I should note that we have patched qsort in the benchmark to work with strict aliasing even with LTO. But the performance gap is there also with -fno-strict-aliasing. Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163 [Bug 26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)