https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90364
Bug ID: 90364 Summary: 521.wrf_r is 9.5 % slower with PGO on Zen CPUs at -Ofast and native march/mtune Product: gcc Version: 9.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: gcov-profile Assignee: unassigned at gcc dot gnu.org Reporter: jamborm at gcc dot gnu.org CC: hubicka at gcc dot gnu.org, marxin at gcc dot gnu.org Blocks: 26163 Target Milestone: --- Host: x86_64-linux Target: x86_64-linux In my measurements using trunk r270639, profile guided optimization (PGO) regresses the run time of 521.wrf_r from SPEC FPrate 2017 by 9.5% (and even LTO+PGO is 7% slower than when using neither) when compiling with -Ofast -march=native -mtune=native. My observations are consistent with data from LNT: https://lnt.opensuse.org/db_default/v4/SPEC/graph?plot.0=33.548.0&plot.1=15.548.0&plot.2=12.548.0&plot.3=17.548.0& Perf stat and report for the two runs are: Non-PGO (fast): 304790.490558 task-clock:u (msec) # 0.994 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 292908 page-faults:u # 0.961 K/sec 962209421444 cycles:u # 24018297656 stalled-cycles-frontend:u # 2.50% frontend cycles idle (83.35%) 142992971234 stalled-cycles-backend:u # 14.86% backend cycles idle (83.33%) 1792410646274 instructions:u # 1.86 insn per cycle # 0.08 stalled cycles per insn (83.34%) 185705451528 branches:u # 609.289 M/sec (83.34%) 2087790818 branch-misses:u # 1.12% of all branches (83.35%) 306.542849367 seconds time elapsed # Samples: 1M of event 'cycles' # Event count (approx.): 964214205064 # # Overhead Samples Shared Object Symbol # ........ ......... ............... .............................................................. # 7.02% 85562 libm-2.29.so __logf_fma 5.99% 72982 libm-2.29.so __powf_fma 5.44% 66794 wrf_r_peak.std __module_advect_em_MOD_advect_scalar_pd 5.21% 63576 libm-2.29.so __atanf 4.30% 52426 libmvec-2.29.so _ZGVbN4v_expf_sse4 4.04% 49253 wrf_r_peak.std __module_mp_wsm5_MOD_wsm52d 3.93% 47888 wrf_r_peak.std __module_mp_wsm5_MOD_nislfv_rain_plm 2.97% 36505 wrf_r_peak.std __module_small_step_em_MOD_advance_uv 2.67% 32786 wrf_r_peak.std __module_small_step_em_MOD_advance_mu_t 2.63% 32334 wrf_r_peak.std __module_small_step_em_MOD_advance_w 2.52% 30796 wrf_r_peak.std __module_mp_wsm5_MOD_slope_wsm5 2.52% 30948 wrf_r_peak.std __module_advect_em_MOD_advect_scalar 2.34% 28718 libc-2.29.so __memset_avx2_unaligned_erms 2.32% 28336 wrf_r_peak.std __module_bl_ysu_MOD_ysu2d 2.18% 26624 wrf_r_peak.std psim_unstable 2.09% 25667 libmvec-2.29.so _ZGVbN4vv_powf_sse4 2.08% 25418 libmvec-2.29.so _ZGVbN4v_logf_sse4 1.87% 22858 wrf_r_peak.std psih_unstable 1.65% 20244 wrf_r_peak.std __module_big_step_utilities_em_MOD_phy_prep 1.56% 19006 wrf_r_peak.std __module_ra_rrtm_MOD_rtrn 1.40% 17198 wrf_r_peak.std __module_bc_MOD_set_physical_bc3d 1.25% 15339 wrf_r_peak.std __module_big_step_utilities_em_MOD_horizontal_diffusion 1.22% 15029 libc-2.29.so __memmove_avx_unaligned_erms 1.22% 14833 libm-2.29.so __expf_fma 1.15% 14101 wrf_r_peak.std __module_small_step_em_MOD_calc_p_rho 1.08% 13312 wrf_r_peak.std __module_big_step_utilities_em_MOD_horizontal_pressure_gradient 1.00% 12345 wrf_r_peak.std __module_big_step_utilities_em_MOD_rhs_ph PGO (slow): 325215.123075 task-clock:u (msec) # 0.993 CPUs utilized 0 context-switches:u # 0.000 K/sec 0 cpu-migrations:u # 0.000 K/sec 302283 page-faults:u # 0.929 K/sec 1026804177693 cycles:u # 3.157 GHz (83.33%) 29812608056 stalled-cycles-frontend:u # 2.90% frontend cycles idle (83.35%) 126544641902 stalled-cycles-backend:u # 12.32% backend cycles idle (83.34%) 1968104678527 instructions:u # 1.92 insn per cycle # 0.06 stalled cycles per insn (83.35%) 199828338783 branches:u # 614.450 M/sec (83.34%) 2418851470 branch-misses:u # 1.21% of all branches (83.35%) 327.574599867 seconds time elapsed # Samples: 1M of event 'cycles' # Event count (approx.): 1029158853895 # # Overhead Samples Shared Object Symbol # ........ ......... .............. ....................................................... # 9.94% 129149 libm-2.29.so __powf_fma 6.77% 87916 libm-2.29.so __logf_fma 6.22% 80774 wrf_r_peak.pgo __module_mp_wsm5_MOD_nislfv_rain_plm 5.50% 71494 wrf_r_peak.pgo __module_mp_wsm5_MOD_wsm52d 5.16% 67454 wrf_r_peak.pgo __module_advect_em_MOD_advect_scalar_pd 4.87% 63208 libm-2.29.so __atanf 4.13% 53689 libm-2.29.so __expf_fma 3.99% 51813 wrf_r_peak.pgo __module_bl_ysu_MOD_ysu2d 2.76% 36137 wrf_r_peak.pgo __module_small_step_em_MOD_advance_uv 2.51% 32915 wrf_r_peak.pgo __module_small_step_em_MOD_advance_w 2.30% 30061 wrf_r_peak.pgo __module_advect_em_MOD_advect_scalar 2.05% 26646 wrf_r_peak.pgo __module_ra_rrtm_MOD_rtrn 1.99% 26017 wrf_r_peak.pgo __module_small_step_em_MOD_advance_mu_t 1.93% 25130 wrf_r_peak.pgo psim_unstable 1.91% 24995 libc-2.29.so __memset_avx2_unaligned_erms 1.69% 21998 wrf_r_peak.pgo psih_unstable 1.41% 18434 wrf_r_peak.pgo __module_bc_MOD_set_physical_bc3d 1.25% 16375 wrf_r_peak.pgo __module_big_step_utilities_em_MOD_phy_prep 1.18% 15384 wrf_r_peak.pgo __module_big_step_utilities_em_MOD_horizontal_diffusion 1.04% 13570 wrf_r_peak.pgo __module_small_step_em_MOD_calc_p_rho Note that calls to libmvec are gone with PGO. However, they could only be generated because the system I used had the necessary Fortran include file, which IIUC the LNT worker did not have until last week and yet the regression can be seen in earlier data too. Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26163 [Bug 26163] [meta-bug] missed optimization in SPEC (2k17, 2k and 2k6 and 95)