Re: [PATCH 1/2]AArch64: Fix costing of scalar throughput based calculation for inner loops [PR121290]

Richard Biener Thu, 21 Aug 2025 00:07:14 -0700

On Wed, 20 Aug 2025, Richard Sandiford wrote:

> Tamar Christina <tamar.christ...@arm.com> writes:
> > To avoid double counting scalar instructions when doing inner-loop costing 
> > the
> > vectorizer uses vect_prologue as the kind instead of vect_body.
> >
> > However doing this results in our throughput based costing to think the 
> > scalar
> > loop issues in 0 cycles (rounded up to 1.0).  The latency costs however are 
> > fine
> >
> > note:  operating on full vectors.
> > note:  Original vector body cost = 10
> > note:  Vector loop iterates at most 25000 times
> > note:  Scalar issue estimate:
> > note:    load operations = 0
> > note:    store operations = 0
> > note:    general operations = 0
> > note:    reduction latency = 0
> > note:    estimated min cycles per iteration = 1.000000
> > note:    estimated cycles per vector iteration (for VF 4) = 4.000000
> > note:  Vector issue estimate:
> > note:    load operations = 1
> > note:    store operations = 0
> > note:    general operations = 2
> > note:    reduction latency = 0
> > note:    estimated min cycles per iteration = 1.000000
> > note:  Cost model analysis:
> >   Vector inside of loop cost: 10
> >   Vector prologue cost: 4
> >   Vector epilogue cost: 0
> >   Scalar iteration cost: 6
> >   Scalar outside cost: 0
> >   Vector outside cost: 4
> >   prologue iterations: 0
> >   epilogue iterations: 0
> >   Calculated minimum iters for profitability: 2
> > note:    Runtime profitability threshold = 4
> > note:    Static estimate profitability threshold = 4
> >
> > This patch, changes it so that when doing inner-loop costing consider all
> > statements for throughput costing.  Typically scalar shouldn't have an 
> > epilogue
> > cost but this also covers any future tweaks to the vectorizer costing where 
> > it
> > might be used as a similar trick for costing.  After this patch the 
> > throughputs
> > for scalar innner-loop vect are now correct:
> >
> > note:  operating on full vectors.
> > note:  Original vector body cost = 500
> > note:  Vector loop iterates at most 25000 times
> > note:  Scalar issue estimate:
> > note:    load operations = 50
> > note:    store operations = 0
> > note:    general operations = 50
> > note:    reduction latency = 0
> > note:    estimated min cycles per iteration = 16.666667
> > note:    estimated cycles per vector iteration (for VF 4) = 66.666667
> > note:  Vector issue estimate:
> > note:    load operations = 1
> > note:    store operations = 0
> > note:    general operations = 2
> > note:    reduction latency = 0
> > note:    estimated min cycles per iteration = 1.000000
> >
> > The cost multiplier of 50 is due to a generic multiplier the vectorizer 
> > applies
> > to inner loop costing.
> 
> I don't really understand why the scalar costing code is written
> the way that it is.  It uses the count parameter to apply a factor of
> LOOP_VINFO_INNER_LOOP_COST_FACTOR for the inner loop, but then uses
> vect_prologue rather than vect_body to prevent adjust_cost_for_freq from
> applying the same factor:
> 
>   /* FORNOW.  */
>   innerloop_iters = 1;
>   if (loop->inner)
>     innerloop_iters = LOOP_VINFO_INNER_LOOP_COST_FACTOR (loop_vinfo);
> 
>   for (i = 0; i < nbbs; i++)
>     {
>       ...
>       basic_block bb = bbs[i];
> 
>       if (bb->loop_father == loop->inner)
>         factor = innerloop_iters;
>       else
>         factor = 1;
> 
>           ...
>         /* We are using vect_prologue here to avoid scaling twice
>            by the inner loop factor.  */
>         record_stmt_cost (&LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo),
>                           factor, kind, stmt_info, 0, vect_prologue);
> 
> Why not just use vect_body rather than vect_prologue for the inner loop,
> and pass 1 instead of factor for all cases?  That seems like it gives the
> target more information and would avoid both the original problem and
> the *50 in the throughput calculations above.
> 
> I fear that if we go with the patch as-is, the scalar througput costs
> will be so inflated that we'll need to find excuses to inflate the
> vector code by the same amount.


I guess this is a left-over hack for not wanting to adjust targets.
x86 does unconditional

  /* Statements in an inner loop relative to the loop being
     vectorized are weighted more heavily.  The value here is
     arbitrary and could potentially be improved with analysis.  */
  retval = adjust_cost_for_freq (stmt_info, where, count * stmt_cost);

and adjust_cost_for_freq only applies to vect_body costs.  I suppose
the inner loop adjustment should be simply removed from scalar
loop costing above, use vect_body and then leave the scale applying
to add_stmt_cost?

Richard.

> Thanks,
> Richard
> 
> >
> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> >
> > Ok for master?
> >
> > Thanks,
> > Tamar
> >
> > gcc/ChangeLog:
> >
> >     PR target/121290
> >     * config/aarch64/aarch64.cc (aarch64_vector_costs::add_stmt_cost):
> >     Include prologue costing when costing inner loop scalar stmts.
> >
> > gcc/testsuite/ChangeLog:
> >
> >     PR target/121290
> >     * gcc.target/aarch64/pr121290.c: New test.
> >
> > ---
> >
> > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> > index 
> > fb8311b655d7300ce22215789437da777b3779ed..1278d563d5efa08b6a9871f69b63f0786877bf99
> >  100644
> > --- a/gcc/config/aarch64/aarch64.cc
> > +++ b/gcc/config/aarch64/aarch64.cc
> > @@ -18060,7 +18060,14 @@ aarch64_vector_costs::add_stmt_cost (int count, 
> > vect_cost_for_stmt kind,
> >    fractional_cost stmt_cost
> >      = aarch64_builtin_vectorization_cost (kind, vectype, misalign);
> >  
> > -  bool in_inner_loop_p = (where == vect_body
> > +  /* When costing for scalars the vectorizer passes them as vect_prologue 
> > cost
> > +     instead of vect_body to avoid double counting of scaled costs.  But 
> > because
> > +     they are passed as vect_prologue our inner loop detection thinks that 
> > all
> > +     the loop statements are outside the inner loop.  As such the issue 
> > rate
> > +     information for inner loops become zero for scalars.  Therefore 
> > accept any
> > +     scalar statement, relardless of the where as part of the innerloop
> > +     body.  */
> > +  bool in_inner_loop_p = ((where == vect_body || m_costing_for_scalar)
> >                       && stmt_info
> >                       && stmt_in_inner_loop_p (m_vinfo, stmt_info));
> >  
> > diff --git a/gcc/testsuite/gcc.target/aarch64/pr121290.c 
> > b/gcc/testsuite/gcc.target/aarch64/pr121290.c
> > new file mode 100644
> > index 
> > 0000000000000000000000000000000000000000..c88cb7e6979ef43ebaf14c3d3f3c4c561bff3e76
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/pr121290.c
> > @@ -0,0 +1,17 @@
> > +/* { dg-do compile } */
> > +/* { dg-additional-options "-O3 -mcpu=neoverse-v2 -fdump-tree-vect-all 
> > -std=c99" } */
> > +
> > +void
> > +f (int *restrict x, int *restrict y, int *restrict z, int n)
> > +{
> > +  for (int i = 0; i < 4; ++i)
> > +    {
> > +      int res = 0;
> > +      for (int j = 0; j < 100; ++j)
> > +        res += y[j] * z[i];
> > +      x[i] = res;
> > +    }
> > +}
> > +
> > +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
> > +/* { dg-final { scan-tree-dump-not "load operations = 0" "vect" } } */
> 

-- 
Richard Biener <rguent...@suse.de>
SUSE Software Solutions Germany GmbH,
Frankenstrasse 146, 90461 Nuernberg, Germany;
GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Re: [PATCH 1/2]AArch64: Fix costing of scalar throughput based calculation for inner loops [PR121290]

Reply via email to