https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119147
--- Comment #4 from Jan Hubicka <hubicka at gcc dot gnu.org> --- Re-benchmarked current trunk -flto -Ofast -march=native (base) and -flto -Ofast -march=native + PGO (peak) on znver3 Estimated Estimated Base Base Base Peak Peak Peak Benchmarks Copies Run Time Rate Copies Run Time Rate --------------- ------- --------- --------- ------- --------- --------- 525.x264_r 1 87.1 20.1 * 1 101 17.3 -flto -Ofast profile is: 7.67% x264_r_base.tru [.] x264_pixel_satd_8x4.lto_priv.0 ◆ 4.80% x264_r_base.tru [.] get_ref.lto_priv.0 ▒ 4.08% x264_r_base.tru [.] mc_chroma.lto_priv.0 ▒ 1.58% x264_r_base.tru [.] x264_me_search_ref ▒ 1.41% x264_r_base.tru [.] pixel_hadamard_ac ▒ 1.31% x264_r_base.tru [.] x264_pixel_satd_4x4.lto_priv.0 ▒ 1.17% x264_r_base.tru [.] sub4x4_dct.lto_priv.0 ▒ 1.11% x264_r_base.tru [.] refine_subpel.lto_priv.0 ▒ 1.10% x264_r_base.tru [.] quant_4x4.lto_priv.0 ▒ 0.98% x264_r_base.tru [.] quant_trellis_cabac.lto_priv.0 ▒ 0.77% x264_r_base.tru [.] hpel_filter.lto_priv.0 ▒ 0.68% x264_r_base.tru [.] x264_pixel_sad_x4_8x8.lto_priv.0 ▒ 0.56% x264_r_base.tru [.] frame_init_lowres_core.lto_priv.0 ▒ 0.55% x264_r_base.tru [.] x264_pixel_sad_x4_16x16.lto_priv.0 ▒ 0.54% x264_r_base.tru [.] x264_pixel_sad_16x16.lto_priv.0 ▒ While with PGO 5.04% x264_r_peak.tru [.] refine_subpel.lto_priv.0 ◆ 4.42% x264_r_peak.tru [.] x264_pixel_satd_8x8.constprop.1 ▒ 3.66% x264_r_peak.tru [.] mc_chroma.constprop.1 ▒ 3.45% x264_r_peak.tru [.] x264_pixel_satd_16x16.lto_priv.0 ▒ 2.78% x264_r_peak.tru [.] x264_me_search_ref ▒ 2.13% x264_r_peak.tru [.] x264_mb_analyse_intra.lto_priv.0 ▒ 2.06% x264_r_peak.tru [.] x264_macroblock_encode ▒ 1.43% x264_r_peak.tru [.] x264_slicetype_mb_cost ▒ 1.38% x264_r_peak.tru [.] mc_chroma.lto_priv.0 ▒ 1.22% x264_r_peak.tru [.] x264_pixel_hadamard_ac_16x16.constprop.0 ▒ 0.99% x264_r_peak.tru [.] x264_mb_encode_8x8_chroma ▒ 0.96% x264_r_peak.tru [.] quant_trellis_cabac.lto_priv.0 ▒ 0.92% x264_r_peak.tru [.] x264_pixel_sad_x4_8x8.lto_priv.0 ▒ 0.77% x264_r_peak.tru [.] hpel_filter.lto_priv.0 ▒ 0.77% x264_r_peak.tru [.] x264_mb_mc_0xywh ▒ 0.73% x264_r_peak.tru [.] x264_pixel_satd_4x4.constprop.1 ▒ We speculatively inline get_ref into refine_subpel (which is called indirectly but pointer is always the same). Similarly we constant propagate stride to mc_chroma. This seems good, but sum of time spent in mc_chroma clones grows up. Inlining decisions on pixel_satd differs but seems fine. Next problem is that vectorizer turns itself off when trip count is low. Following hack: diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc index 9413dcef702..8882a5dea11 100644 --- a/gcc/tree-vect-loop.cc +++ b/gcc/tree-vect-loop.cc @@ -2483,14 +2483,16 @@ vect_analyze_loop_costing (loop_vec_info loop_vinfo, if (estimated_niter == -1) estimated_niter = likely_max_stmt_executions_int (loop); } - if (estimated_niter != -1 + if (estimated_niter != -1 && 0 && ((unsigned HOST_WIDE_INT) estimated_niter < MAX (th, (unsigned) min_profitable_estimate))) { if (dump_enabled_p ()) dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, - "not vectorized: estimated iteration count too " - "small.\n"); + "not vectorized: estimated iteration count %li smaller " + "than threshold %li.\n", + (long) estimated_niter, + (long MAX (th, (unsigned) min_profitable_estimate))); if (dump_enabled_p ()) dump_printf_loc (MSG_NOTE, vect_location, "not vectorized: estimated iteration count smaller " improves PGO score to 18.1 (96.6 runtime). This speeds up mc_chroma.constprop.1 by about 50%. Unvectorized: │ for( int x = 0; x < i_width; x++ ) ▒ │ dst[x] = ( cA*src[x] + cB*src[x+1] + cC*srcp[x]▒ 0.00 │a0:┌─ movzbl (%rcx,%rax,1),%edx ▒ 1.69 │ │ movzbl 0x1(%rcx,%rax,1),%r14d ▒ 0.15 │ │ imul %ebx,%edx ▒ 2.57 │ │ imul %r10d,%r14d ▒ 1.95 │ │ add %r14d,%edx ▒ 24.93 │ │ movzbl (%rsi,%rax,1),%r14d ▒ 0.65 │ │ imul %r9d,%r14d ▒ 0.12 │ │ add %r14d,%edx ▒ 7.48 │ │ movzbl 0x1(%rsi,%rax,1),%r14d ▒ 1.60 │ │ imul %r11d,%r14d ▒ 0.03 │ │ lea 0x20(%rdx,%r14,1),%edx ▒ 16.81 │ │ sar $0x6,%edx ▒ 34.78 │ │ mov %dl,(%rdi,%rax,1) ▒ │ │for( int x = 0; x < i_width; x++ ) ▒ 0.01 │ │ inc %rax ▒ 0.01 │ ├──cmp %rax,%r8 ▒ 0.02 │ └──jne a0 ▒ But still we don't get same speed as w/o PGO...