On Tue, Nov 12, 2019 at 4:19 PM Richard Biener
<richard.guent...@gmail.com> wrote:
>
> On Tue, Nov 12, 2019 at 8:36 AM Hongtao Liu <crazy...@gmail.com> wrote:
> >
> > Hi:
> >   This patch is about to set X86_TUNE_AVX128_OPTIMAL as default for
> > all AVX target because we found there's still performance gap between
> > 128-bit auto-vectorization and 256-bit auto-vectorization even with
> > epilog vectorized.
> >   The performance influence of setting avx128_optimal as default on
> > SPEC2017 with option `-march=native -funroll-loops -Ofast -flto" on
> > CLX is as bellow:
> >
> >     INT rate
> >     500.perlbench_r         -0.32%
> >     502.gcc_r                       -1.32%
> >     505.mcf_r                       -0.12%
> >     520.omnetpp_r                   -0.34%
> >     523.xalancbmk_r         -0.65%
> >     525.x264_r                      2.23%
> >     531.deepsjeng_r         0.81%
> >     541.leela_r                     -0.02%
> >     548.exchange2_r         10.89%  ----------> big improvement
> >     557.xz_r                        0.38%
> >     geomean for intrate             1.10%
> >
> >     FP rate
> >     503.bwaves_r                    1.41%
> >     507.cactuBSSN_r         -0.14%
> >     508.namd_r                      1.54%
> >     510.parest_r                    -0.87%
> >     511.povray_r                    0.28%
> >     519.lbm_r                       0.32%
> >     521.wrf_r                       -0.54%
> >     526.blender_r                   0.59%
> >     527.cam4_r                      -2.70%
> >     538.imagick_r                   3.92%
> >     544.nab_r                       0.59%
> >     549.fotonik3d_r         -5.44%  -------------> regression
> >     554.roms_r                      -2.34%
> >     geomean for fprate              -0.28%
> >
> > The 10% improvement of 548.exchange_r is because there is 9-layer
> > nested loop, and the loop count for innermost layer is small(enough
> > for 128-bit vectorization, but not for 256-bit vectorization).
> > Since loop count is not statically analyzed out, vectorizer will
> > choose 256-bit vectorization which would never never be triggered. The
> > vectorization of epilog will introduced some extra instructions,
> > normally it will bring back some performance, but since it's 9-layer
> > nested loop, costs of extra instructions will cover the gain.
> >
> > The 5.44% regression of 549.fotonik3d_r is because 256-bit
> > vectorization is better than 128-bit vectorization. Generally when
> > enabling 256-bit or 512-bit vectorization, there will be instruction
> > clocksticks reduction also with frequency reduction. when frequency
> > reduction is less than instructions clocksticks reduction, long vector
> > width vectorization would be better than shorter one, otherwise the
> > opposite. The regression of 549.fotonik3d_r is due to this, similar
> > for 554.roms_r, 528.cam4_r, for those 3 benchmarks, 512-bit
> > vectorization is best.
> >
> > Bootstrap and regression test on i386 is ok.
> > Ok for trunk?
>
> I don't think 128_optimal does what you think it does.  If you want to
> prefer 128bit AVX adjust the preference, but 128_optimal describes
> a microarchitectural detail (AVX256 ops are split into two AVX128 ops)
But it will set target_prefer_avx128 by default.
------------------------
2694  /* Enable 128-bit AVX instruction generation
2695     for the auto-vectorizer.  */
2696  if (TARGET_AVX128_OPTIMAL
2697      && (opts_set->x_prefer_vector_width_type == PVW_NONE))
2698    opts->x_prefer_vector_width_type = PVW_AVX128;
-------------------------
And it may be too confusing to add another tuning flag.
> and is _not_ intended for "tuning".
>
> Richard.
>
> > Changelog
> >     gcc/
> >             * config/i386/i386-option.c (m_CORE_AVX): New macro.
> >             * config/i386/x86-tune.def: Enable 128_optimal for avx and
> >             replace m_SANDYBRIDGE | m_CORE_AVX2 with m_CORE_AVX.
> >             * testsuite/gcc.target/i386/pr84413-1.c: Adjust testcase.
> >             * testsuite/gcc.target/i386/pr84413-2.c: Ditto.
> >             * testsuite/gcc.target/i386/pr84413-3.c: Ditto.
> >             * testsuite/gcc.target/i386/pr70021.c: Ditto.
> >             * testsuite/gcc.target/i386/pr90579.c: New test.
> >
> >
> > --
> > BR,
> > Hongtao



-- 
BR,
Hongtao

Reply via email to