On Wed, 20 Aug 2025, Richard Sandiford wrote: > Tamar Christina <tamar.christ...@arm.com> writes: > > To avoid double counting scalar instructions when doing inner-loop costing > > the > > vectorizer uses vect_prologue as the kind instead of vect_body. > > > > However doing this results in our throughput based costing to think the > > scalar > > loop issues in 0 cycles (rounded up to 1.0). The latency costs however are > > fine > > > > note: operating on full vectors. > > note: Original vector body cost = 10 > > note: Vector loop iterates at most 25000 times > > note: Scalar issue estimate: > > note: load operations = 0 > > note: store operations = 0 > > note: general operations = 0 > > note: reduction latency = 0 > > note: estimated min cycles per iteration = 1.000000 > > note: estimated cycles per vector iteration (for VF 4) = 4.000000 > > note: Vector issue estimate: > > note: load operations = 1 > > note: store operations = 0 > > note: general operations = 2 > > note: reduction latency = 0 > > note: estimated min cycles per iteration = 1.000000 > > note: Cost model analysis: > > Vector inside of loop cost: 10 > > Vector prologue cost: 4 > > Vector epilogue cost: 0 > > Scalar iteration cost: 6 > > Scalar outside cost: 0 > > Vector outside cost: 4 > > prologue iterations: 0 > > epilogue iterations: 0 > > Calculated minimum iters for profitability: 2 > > note: Runtime profitability threshold = 4 > > note: Static estimate profitability threshold = 4 > > > > This patch, changes it so that when doing inner-loop costing consider all > > statements for throughput costing. Typically scalar shouldn't have an > > epilogue > > cost but this also covers any future tweaks to the vectorizer costing where > > it > > might be used as a similar trick for costing. After this patch the > > throughputs > > for scalar innner-loop vect are now correct: > > > > note: operating on full vectors. > > note: Original vector body cost = 500 > > note: Vector loop iterates at most 25000 times > > note: Scalar issue estimate: > > note: load operations = 50 > > note: store operations = 0 > > note: general operations = 50 > > note: reduction latency = 0 > > note: estimated min cycles per iteration = 16.666667 > > note: estimated cycles per vector iteration (for VF 4) = 66.666667 > > note: Vector issue estimate: > > note: load operations = 1 > > note: store operations = 0 > > note: general operations = 2 > > note: reduction latency = 0 > > note: estimated min cycles per iteration = 1.000000 > > > > The cost multiplier of 50 is due to a generic multiplier the vectorizer > > applies > > to inner loop costing. > > I don't really understand why the scalar costing code is written > the way that it is. It uses the count parameter to apply a factor of > LOOP_VINFO_INNER_LOOP_COST_FACTOR for the inner loop, but then uses > vect_prologue rather than vect_body to prevent adjust_cost_for_freq from > applying the same factor: > > /* FORNOW. */ > innerloop_iters = 1; > if (loop->inner) > innerloop_iters = LOOP_VINFO_INNER_LOOP_COST_FACTOR (loop_vinfo); > > for (i = 0; i < nbbs; i++) > { > ... > basic_block bb = bbs[i]; > > if (bb->loop_father == loop->inner) > factor = innerloop_iters; > else > factor = 1; > > ... > /* We are using vect_prologue here to avoid scaling twice > by the inner loop factor. */ > record_stmt_cost (&LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo), > factor, kind, stmt_info, 0, vect_prologue); > > Why not just use vect_body rather than vect_prologue for the inner loop, > and pass 1 instead of factor for all cases? That seems like it gives the > target more information and would avoid both the original problem and > the *50 in the throughput calculations above. > > I fear that if we go with the patch as-is, the scalar througput costs > will be so inflated that we'll need to find excuses to inflate the > vector code by the same amount.
I guess this is a left-over hack for not wanting to adjust targets. x86 does unconditional /* Statements in an inner loop relative to the loop being vectorized are weighted more heavily. The value here is arbitrary and could potentially be improved with analysis. */ retval = adjust_cost_for_freq (stmt_info, where, count * stmt_cost); and adjust_cost_for_freq only applies to vect_body costs. I suppose the inner loop adjustment should be simply removed from scalar loop costing above, use vect_body and then leave the scale applying to add_stmt_cost? Richard. > Thanks, > Richard > > > > > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues. > > > > Ok for master? > > > > Thanks, > > Tamar > > > > gcc/ChangeLog: > > > > PR target/121290 > > * config/aarch64/aarch64.cc (aarch64_vector_costs::add_stmt_cost): > > Include prologue costing when costing inner loop scalar stmts. > > > > gcc/testsuite/ChangeLog: > > > > PR target/121290 > > * gcc.target/aarch64/pr121290.c: New test. > > > > --- > > > > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc > > index > > fb8311b655d7300ce22215789437da777b3779ed..1278d563d5efa08b6a9871f69b63f0786877bf99 > > 100644 > > --- a/gcc/config/aarch64/aarch64.cc > > +++ b/gcc/config/aarch64/aarch64.cc > > @@ -18060,7 +18060,14 @@ aarch64_vector_costs::add_stmt_cost (int count, > > vect_cost_for_stmt kind, > > fractional_cost stmt_cost > > = aarch64_builtin_vectorization_cost (kind, vectype, misalign); > > > > - bool in_inner_loop_p = (where == vect_body > > + /* When costing for scalars the vectorizer passes them as vect_prologue > > cost > > + instead of vect_body to avoid double counting of scaled costs. But > > because > > + they are passed as vect_prologue our inner loop detection thinks that > > all > > + the loop statements are outside the inner loop. As such the issue > > rate > > + information for inner loops become zero for scalars. Therefore > > accept any > > + scalar statement, relardless of the where as part of the innerloop > > + body. */ > > + bool in_inner_loop_p = ((where == vect_body || m_costing_for_scalar) > > && stmt_info > > && stmt_in_inner_loop_p (m_vinfo, stmt_info)); > > > > diff --git a/gcc/testsuite/gcc.target/aarch64/pr121290.c > > b/gcc/testsuite/gcc.target/aarch64/pr121290.c > > new file mode 100644 > > index > > 0000000000000000000000000000000000000000..c88cb7e6979ef43ebaf14c3d3f3c4c561bff3e76 > > --- /dev/null > > +++ b/gcc/testsuite/gcc.target/aarch64/pr121290.c > > @@ -0,0 +1,17 @@ > > +/* { dg-do compile } */ > > +/* { dg-additional-options "-O3 -mcpu=neoverse-v2 -fdump-tree-vect-all > > -std=c99" } */ > > + > > +void > > +f (int *restrict x, int *restrict y, int *restrict z, int n) > > +{ > > + for (int i = 0; i < 4; ++i) > > + { > > + int res = 0; > > + for (int j = 0; j < 100; ++j) > > + res += y[j] * z[i]; > > + x[i] = res; > > + } > > +} > > + > > +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */ > > +/* { dg-final { scan-tree-dump-not "load operations = 0" "vect" } } */ > -- Richard Biener <rguent...@suse.de> SUSE Software Solutions Germany GmbH, Frankenstrasse 146, 90461 Nuernberg, Germany; GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)