On Wed, Sep 11, 2024 at 4:04 PM Richard Biener
<[email protected]> wrote:
>
> On Wed, Sep 11, 2024 at 4:17 AM liuhongt <[email protected]> wrote:
> >
> > GCC12 enables vectorization for O2 with very cheap cost model which is
> > restricted
> > to constant tripcount. The vectorization capacity is very limited w/
> > consideration
> > of codesize impact.
> >
> > The patch extends the very cheap cost model a little bit to support
> > variable tripcount.
> > But still disable peeling for gaps/alignment, runtime aliasing checking and
> > epilogue
> > vectorization with the consideration of codesize.
> >
> > So there're at most 2 versions of loop for O2 vectorization, one vectorized
> > main loop
> > , one scalar/remainder loop.
> >
> > .i.e.
> >
> > void
> > foo1 (int* __restrict a, int* b, int* c, int n)
> > {
> > for (int i = 0; i != n; i++)
> > a[i] = b[i] + c[i];
> > }
> >
> > with -O2 -march=x86-64-v3, will be vectorized to
> >
> > .L10:
> > vmovdqu (%r8,%rax), %ymm0
> > vpaddd (%rsi,%rax), %ymm0, %ymm0
> > vmovdqu %ymm0, (%rdi,%rax)
> > addq $32, %rax
> > cmpq %rdx, %rax
> > jne .L10
> > movl %ecx, %eax
> > andl $-8, %eax
> > cmpl %eax, %ecx
> > je .L21
> > vzeroupper
> > .L12:
> > movl (%r8,%rax,4), %edx
> > addl (%rsi,%rax,4), %edx
> > movl %edx, (%rdi,%rax,4)
> > addq $1, %rax
> > cmpl %eax, %ecx
> > jne .L12
> >
> > As measured with SPEC2017 on EMR, the patch(N-Iter) improves performance by
> > 4.11%
> > with extra 2.8% codeisze, and cheap cost model improve performance by 5.74%
> > with
> > extra 8.88% codesize. The details are as below
>
> I'm confused by this, is the N-Iter numbers ontop of the cheap cost
> model numbers?
No, it's N-iter vs base(very cheap cost model), and cheap vs base.
>
> > Performance measured with -march=x86-64-v3 -O2 on EMR
> >
> > N-Iter cheap cost model
> > 500.perlbench_r -0.12% -0.12%
> > 502.gcc_r 0.44% -0.11%
> > 505.mcf_r 0.17% 4.46%
> > 520.omnetpp_r 0.28% -0.27%
> > 523.xalancbmk_r 0.00% 5.93%
> > 525.x264_r -0.09% 23.53%
> > 531.deepsjeng_r 0.19% 0.00%
> > 541.leela_r 0.22% 0.00%
> > 548.exchange2_r -11.54% -22.34%
> > 557.xz_r 0.74% 0.49%
> > GEOMEAN INT -1.04% 0.60%
> >
> > 503.bwaves_r 3.13% 4.72%
> > 507.cactuBSSN_r 1.17% 0.29%
> > 508.namd_r 0.39% 6.87%
> > 510.parest_r 3.14% 8.52%
> > 511.povray_r 0.10% -0.20%
> > 519.lbm_r -0.68% 10.14%
> > 521.wrf_r 68.20% 76.73%
>
> So this seems to regress as well?
Niter increases performance less than the cheap cost model, that's
expected, it is not a regression.
>
> > 526.blender_r 0.12% 0.12%
> > 527.cam4_r 19.67% 23.21%
> > 538.imagick_r 0.12% 0.24%
> > 544.nab_r 0.63% 0.53%
> > 549.fotonik3d_r 14.44% 9.43%
> > 554.roms_r 12.39% 0.00%
> > GEOMEAN FP 8.26% 9.41%
> > GEOMEAN ALL 4.11% 5.74%
> >
> > Code sise impact
> > N-Iter cheap cost model
> > 500.perlbench_r 0.22% 1.03%
> > 502.gcc_r 0.25% 0.60%
> > 505.mcf_r 0.00% 32.07%
> > 520.omnetpp_r 0.09% 0.31%
> > 523.xalancbmk_r 0.08% 1.86%
> > 525.x264_r 0.75% 7.96%
> > 531.deepsjeng_r 0.72% 3.28%
> > 541.leela_r 0.18% 0.75%
> > 548.exchange2_r 8.29% 12.19%
> > 557.xz_r 0.40% 0.60%
> > GEOMEAN INT 1.07%% 5.71%
> >
> > 503.bwaves_r 12.89% 21.59%
> > 507.cactuBSSN_r 0.90% 20.19%
> > 508.namd_r 0.77% 14.75%
> > 510.parest_r 0.91% 3.91%
> > 511.povray_r 0.45% 4.08%
> > 519.lbm_r 0.00% 0.00%
> > 521.wrf_r 5.97% 12.79%
> > 526.blender_r 0.49% 3.84%
> > 527.cam4_r 1.39% 3.28%
> > 538.imagick_r 1.86% 7.78%
> > 544.nab_r 0.41% 3.00%
> > 549.fotonik3d_r 25.50% 47.47%
> > 554.roms_r 5.17% 13.01%
> > GEOMEAN FP 4.14% 11.38%
> > GEOMEAN ALL 2.80% 8.88%
> >
> >
> > The only regression is from 548.exchange_r, the vectorization for inner
> > loop in each layer
> > of the 9-layer loops increases register pressure and causes more spill.
> > - block(rnext:9, 1, i1) = block(rnext:9, 1, i1) + 10
> > - block(rnext:9, 2, i2) = block(rnext:9, 2, i2) + 10
> > .....
> > - block(rnext:9, 9, i9) = block(rnext:9, 9, i9) + 10
> > ...
> > - block(rnext:9, 2, i2) = block(rnext:9, 2, i2) + 10
> > - block(rnext:9, 1, i1) = block(rnext:9, 1, i1) + 10
> >
> > Looks like aarch64 doesn't have the issue because aarch64 has 32 gprs, but
> > x86 only has 16.
> > I have a extra patch to prevent loop vectorization in deep-depth loop for
> > x86 backend which can
> > bring the performance back.
> >
> > For 503.bwaves_r/505.mcf_r/507.cactuBSSN_r/508.namd_r, cheap cost model
> > increases codesize
> > a lot but don't imporve any performance. And N-iter is much better for that
> > for codesize.
> >
> >
> > Any comments?
> >
> >
> > gcc/ChangeLog:
> >
> > * tree-vect-loop.cc (vect_analyze_loop_costing): Enable
> > vectorization for LOOP_VINFO_PEELING_FOR_NITER in very cheap
> > cost model.
> > (vect_analyze_loop): Disable epilogue vectorization in very
> > cheap cost model.
> > ---
> > gcc/tree-vect-loop.cc | 6 +++---
> > 1 file changed, 3 insertions(+), 3 deletions(-)
> >
> > diff --git a/gcc/tree-vect-loop.cc b/gcc/tree-vect-loop.cc
> > index 242d5e2d916..06afd8cae79 100644
> > --- a/gcc/tree-vect-loop.cc
> > +++ b/gcc/tree-vect-loop.cc
> > @@ -2356,8 +2356,7 @@ vect_analyze_loop_costing (loop_vec_info loop_vinfo,
> > a copy of the scalar code (even if we might be able to vectorize it).
> > */
> > if (loop_cost_model (loop) == VECT_COST_MODEL_VERY_CHEAP
> > && (LOOP_VINFO_PEELING_FOR_ALIGNMENT (loop_vinfo)
> > - || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)
> > - || LOOP_VINFO_PEELING_FOR_NITER (loop_vinfo)))
> > + || LOOP_VINFO_PEELING_FOR_GAPS (loop_vinfo)))
>
> I notice that we should probably not call
> vect_enhance_data_refs_alignment because
> when alignment peeling is optional we should avoid it rather than disabling
> the
> vectorization completely.
>
> Also if you allow peeling for niter then there's no good reason to not
> allow peeling
> for gaps (or any other epilogue peeling).
Maybe, I just want to be conservative.
>
> The extra cost for niter peeling is a runtime check before the loop
> which would also
> happen (plus keeping the scalar copy) when there's a runtime cost check. That
> also means versioning for alias/alignment could be allowed if it
> shares the scalar
> loop with the epilogue (I don't remember the constraints we set in place for
> the
> sharing).
Yes, but for current GCC, alias run-time check creates a separate scalar loop
https://godbolt.org/z/9seoWePKK
And enabling alias runtime check could increase too much codesize but
w/o any performance improvement.
>
> Richard.
>
> > {
> > if (dump_enabled_p ())
> > dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> > @@ -3638,7 +3637,8 @@ vect_analyze_loop (class loop *loop, vec_info_shared
> > *shared)
> > /* No code motion support for multiple epilogues
> > so for now
> > not supported when multiple exits. */
> > && !LOOP_VINFO_EARLY_BREAKS (first_loop_vinfo)
> > - && !loop->simduid);
> > + && !loop->simduid
> > + && loop_cost_model (loop) >
> > VECT_COST_MODEL_VERY_CHEAP);
> > if (!vect_epilogues)
> > return first_loop_vinfo;
> >
> > --
> > 2.31.1
> >
--
BR,
Hongtao