On Tue, Nov 12, 2019 at 4:41 PM Richard Biener <richard.guent...@gmail.com> wrote: > > On Tue, Nov 12, 2019 at 9:29 AM Hongtao Liu <crazy...@gmail.com> wrote: > > > > On Tue, Nov 12, 2019 at 4:19 PM Richard Biener > > <richard.guent...@gmail.com> wrote: > > > > > > On Tue, Nov 12, 2019 at 8:36 AM Hongtao Liu <crazy...@gmail.com> wrote: > > > > > > > > Hi: > > > > This patch is about to set X86_TUNE_AVX128_OPTIMAL as default for > > > > all AVX target because we found there's still performance gap between > > > > 128-bit auto-vectorization and 256-bit auto-vectorization even with > > > > epilog vectorized. > > > > The performance influence of setting avx128_optimal as default on > > > > SPEC2017 with option `-march=native -funroll-loops -Ofast -flto" on > > > > CLX is as bellow: > > > > > > > > INT rate > > > > 500.perlbench_r -0.32% > > > > 502.gcc_r -1.32% > > > > 505.mcf_r -0.12% > > > > 520.omnetpp_r -0.34% > > > > 523.xalancbmk_r -0.65% > > > > 525.x264_r 2.23% > > > > 531.deepsjeng_r 0.81% > > > > 541.leela_r -0.02% > > > > 548.exchange2_r 10.89% ----------> big improvement > > > > 557.xz_r 0.38% > > > > geomean for intrate 1.10% > > > > > > > > FP rate > > > > 503.bwaves_r 1.41% > > > > 507.cactuBSSN_r -0.14% > > > > 508.namd_r 1.54% > > > > 510.parest_r -0.87% > > > > 511.povray_r 0.28% > > > > 519.lbm_r 0.32% > > > > 521.wrf_r -0.54% > > > > 526.blender_r 0.59% > > > > 527.cam4_r -2.70% > > > > 538.imagick_r 3.92% > > > > 544.nab_r 0.59% > > > > 549.fotonik3d_r -5.44% -------------> regression > > > > 554.roms_r -2.34% > > > > geomean for fprate -0.28% > > > > > > > > The 10% improvement of 548.exchange_r is because there is 9-layer > > > > nested loop, and the loop count for innermost layer is small(enough > > > > for 128-bit vectorization, but not for 256-bit vectorization). > > > > Since loop count is not statically analyzed out, vectorizer will > > > > choose 256-bit vectorization which would never never be triggered. The > > > > vectorization of epilog will introduced some extra instructions, > > > > normally it will bring back some performance, but since it's 9-layer > > > > nested loop, costs of extra instructions will cover the gain. > > > > > > > > The 5.44% regression of 549.fotonik3d_r is because 256-bit > > > > vectorization is better than 128-bit vectorization. Generally when > > > > enabling 256-bit or 512-bit vectorization, there will be instruction > > > > clocksticks reduction also with frequency reduction. when frequency > > > > reduction is less than instructions clocksticks reduction, long vector > > > > width vectorization would be better than shorter one, otherwise the > > > > opposite. The regression of 549.fotonik3d_r is due to this, similar > > > > for 554.roms_r, 528.cam4_r, for those 3 benchmarks, 512-bit > > > > vectorization is best. > > > > > > > > Bootstrap and regression test on i386 is ok. > > > > Ok for trunk? > > > > > > I don't think 128_optimal does what you think it does. If you want to > > > prefer 128bit AVX adjust the preference, but 128_optimal describes > > > a microarchitectural detail (AVX256 ops are split into two AVX128 ops) > > But it will set target_prefer_avx128 by default. > > ------------------------ > > 2694 /* Enable 128-bit AVX instruction generation > > 2695 for the auto-vectorizer. */ > > 2696 if (TARGET_AVX128_OPTIMAL > > 2697 && (opts_set->x_prefer_vector_width_type == PVW_NONE)) > > 2698 opts->x_prefer_vector_width_type = PVW_AVX128; > > ------------------------- > > And it may be too confusing to add another tuning flag. > > Well, it's confusing to mix two things - defaulting the vector width > preference > and the architectural detail of Bulldozer and early Zen parts. So please > split > the tuning. And then re-benchmark with _just_ changing the preference Actually, the result is similar, I've test both(patch using avx128_optimal and trunk_gcc apply additional -mprefer-vector-width=128). And i would give a test to see the affect of FDO. > but not enabling the architectural detail which isn't true for any Intel parts > AFAIK. > > Richard. > > > > and is _not_ intended for "tuning". > > > > > > Richard. > > > > > > > Changelog > > > > gcc/ > > > > * config/i386/i386-option.c (m_CORE_AVX): New macro. > > > > * config/i386/x86-tune.def: Enable 128_optimal for avx and > > > > replace m_SANDYBRIDGE | m_CORE_AVX2 with m_CORE_AVX. > > > > * testsuite/gcc.target/i386/pr84413-1.c: Adjust testcase. > > > > * testsuite/gcc.target/i386/pr84413-2.c: Ditto. > > > > * testsuite/gcc.target/i386/pr84413-3.c: Ditto. > > > > * testsuite/gcc.target/i386/pr70021.c: Ditto. > > > > * testsuite/gcc.target/i386/pr90579.c: New test. > > > > > > > > > > > > -- > > > > BR, > > > > Hongtao > > > > > > > > -- > > BR, > > Hongtao
-- BR, Hongtao