https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119147

--- Comment #4 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
Re-benchmarked current trunk -flto -Ofast -march=native (base) and  -flto
-Ofast -march=native + PGO (peak) on znver3
                       Estimated                       Estimated
                 Base     Base        Base        Peak     Peak        Peak
Benchmarks       Copies  Run Time     Rate        Copies  Run Time     Rate
--------------- -------  ---------  ---------    -------  ---------  ---------
525.x264_r            1       87.1       20.1  *       1        101       17.3  

-flto -Ofast profile is:
   7.67%  x264_r_base.tru  [.] x264_pixel_satd_8x4.lto_priv.0         ◆
   4.80%  x264_r_base.tru  [.] get_ref.lto_priv.0                     ▒
   4.08%  x264_r_base.tru  [.] mc_chroma.lto_priv.0                   ▒
   1.58%  x264_r_base.tru  [.] x264_me_search_ref                     ▒
   1.41%  x264_r_base.tru  [.] pixel_hadamard_ac                      ▒
   1.31%  x264_r_base.tru  [.] x264_pixel_satd_4x4.lto_priv.0         ▒
   1.17%  x264_r_base.tru  [.] sub4x4_dct.lto_priv.0                  ▒
   1.11%  x264_r_base.tru  [.] refine_subpel.lto_priv.0               ▒
   1.10%  x264_r_base.tru  [.] quant_4x4.lto_priv.0                   ▒
   0.98%  x264_r_base.tru  [.] quant_trellis_cabac.lto_priv.0         ▒
   0.77%  x264_r_base.tru  [.] hpel_filter.lto_priv.0                 ▒
   0.68%  x264_r_base.tru  [.] x264_pixel_sad_x4_8x8.lto_priv.0       ▒
   0.56%  x264_r_base.tru  [.] frame_init_lowres_core.lto_priv.0      ▒
   0.55%  x264_r_base.tru  [.] x264_pixel_sad_x4_16x16.lto_priv.0     ▒
   0.54%  x264_r_base.tru  [.] x264_pixel_sad_16x16.lto_priv.0        ▒

While with PGO
   5.04%  x264_r_peak.tru  [.] refine_subpel.lto_priv.0                    ◆
   4.42%  x264_r_peak.tru  [.] x264_pixel_satd_8x8.constprop.1             ▒
   3.66%  x264_r_peak.tru  [.] mc_chroma.constprop.1                       ▒
   3.45%  x264_r_peak.tru  [.] x264_pixel_satd_16x16.lto_priv.0            ▒
   2.78%  x264_r_peak.tru  [.] x264_me_search_ref                          ▒
   2.13%  x264_r_peak.tru  [.] x264_mb_analyse_intra.lto_priv.0            ▒
   2.06%  x264_r_peak.tru  [.] x264_macroblock_encode                      ▒
   1.43%  x264_r_peak.tru  [.] x264_slicetype_mb_cost                      ▒
   1.38%  x264_r_peak.tru  [.] mc_chroma.lto_priv.0                        ▒
   1.22%  x264_r_peak.tru  [.] x264_pixel_hadamard_ac_16x16.constprop.0    ▒
   0.99%  x264_r_peak.tru  [.] x264_mb_encode_8x8_chroma                   ▒
   0.96%  x264_r_peak.tru  [.] quant_trellis_cabac.lto_priv.0              ▒
   0.92%  x264_r_peak.tru  [.] x264_pixel_sad_x4_8x8.lto_priv.0            ▒
   0.77%  x264_r_peak.tru  [.] hpel_filter.lto_priv.0                      ▒
   0.77%  x264_r_peak.tru  [.] x264_mb_mc_0xywh                            ▒
   0.73%  x264_r_peak.tru  [.] x264_pixel_satd_4x4.constprop.1             ▒

We speculatively inline get_ref into refine_subpel (which is called indirectly
but pointer is always the same).  Similarly we constant propagate stride to
mc_chroma. This seems good, but sum of time spent in mc_chroma clones grows up.
Inlining decisions on pixel_satd differs but seems fine.

Next problem is that vectorizer turns itself off when trip count is low.
Following hack:

diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
index 9413dcef702..8882a5dea11 100644
--- a/gcc/tree-vect-loop.cc
+++ b/gcc/tree-vect-loop.cc
@@ -2483,14 +2483,16 @@ vect_analyze_loop_costing (loop_vec_info loop_vinfo,
       if (estimated_niter == -1)
        estimated_niter = likely_max_stmt_executions_int (loop);
     }
-  if (estimated_niter != -1
+  if (estimated_niter != -1 && 0
       && ((unsigned HOST_WIDE_INT) estimated_niter
          < MAX (th, (unsigned) min_profitable_estimate)))
     {
       if (dump_enabled_p ())
        dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
-                        "not vectorized: estimated iteration count too "
-                        "small.\n");
+                        "not vectorized: estimated iteration count %li smaller
"
+                        "than threshold %li.\n",
+                        (long) estimated_niter,
+                        (long MAX (th, (unsigned) min_profitable_estimate)));
       if (dump_enabled_p ())
        dump_printf_loc (MSG_NOTE, vect_location,
                         "not vectorized: estimated iteration count smaller "

improves PGO score to 18.1 (96.6 runtime).

This speeds up mc_chroma.constprop.1 by about 50%. Unvectorized:

        │    for( int x = 0; x < i_width; x++ )              ▒
        │    dst[x] = ( cA*src[x]  + cB*src[x+1] + cC*srcp[x]▒
   0.00 │a0:┌─ movzbl (%rcx,%rax,1),%edx                     ▒
   1.69 │   │  movzbl 0x1(%rcx,%rax,1),%r14d                 ▒
   0.15 │   │  imul   %ebx,%edx                              ▒
   2.57 │   │  imul   %r10d,%r14d                            ▒
   1.95 │   │  add    %r14d,%edx                             ▒
  24.93 │   │  movzbl (%rsi,%rax,1),%r14d                    ▒
   0.65 │   │  imul   %r9d,%r14d                             ▒
   0.12 │   │  add    %r14d,%edx                             ▒
   7.48 │   │  movzbl 0x1(%rsi,%rax,1),%r14d                 ▒
   1.60 │   │  imul   %r11d,%r14d                            ▒
   0.03 │   │  lea    0x20(%rdx,%r14,1),%edx                 ▒
  16.81 │   │  sar    $0x6,%edx                              ▒
  34.78 │   │  mov    %dl,(%rdi,%rax,1)                      ▒
        │   │for( int x = 0; x < i_width; x++ )              ▒
   0.01 │   │  inc    %rax                                   ▒
   0.01 │   ├──cmp    %rax,%r8                               ▒
   0.02 │   └──jne    a0                                     ▒

But still we don't get same speed as w/o PGO...

Reply via email to