RE: [PATCH 1/2]AArch64: Fix costing of scalar throughput based calculation for inner loops [PR121290]

Tamar Christina Wed, 20 Aug 2025 22:16:22 -0700

> -----Original Message-----
> From: Richard Sandiford <richard.sandif...@arm.com>
> Sent: Wednesday, August 20, 2025 11:13 PM
> To: Tamar Christina <tamar.christ...@arm.com>
> Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; Richard Earnshaw
> <richard.earns...@arm.com>; ktkac...@gcc.gnu.org; Alex Coplan
> <alex.cop...@arm.com>; andrew.pin...@oss.qualcomm.com;
> rguent...@suse.de
> Subject: Re: [PATCH 1/2]AArch64: Fix costing of scalar throughput based
> calculation for inner loops [PR121290]
> 
> Tamar Christina <tamar.christ...@arm.com> writes:
> > To avoid double counting scalar instructions when doing inner-loop costing 
> > the
> > vectorizer uses vect_prologue as the kind instead of vect_body.
> >
> > However doing this results in our throughput based costing to think the 
> > scalar
> > loop issues in 0 cycles (rounded up to 1.0).  The latency costs however are 
> > fine
> >
> > note:  operating on full vectors.
> > note:  Original vector body cost = 10
> > note:  Vector loop iterates at most 25000 times
> > note:  Scalar issue estimate:
> > note:    load operations = 0
> > note:    store operations = 0
> > note:    general operations = 0
> > note:    reduction latency = 0
> > note:    estimated min cycles per iteration = 1.000000
> > note:    estimated cycles per vector iteration (for VF 4) = 4.000000
> > note:  Vector issue estimate:
> > note:    load operations = 1
> > note:    store operations = 0
> > note:    general operations = 2
> > note:    reduction latency = 0
> > note:    estimated min cycles per iteration = 1.000000
> > note:  Cost model analysis:
> >   Vector inside of loop cost: 10
> >   Vector prologue cost: 4
> >   Vector epilogue cost: 0
> >   Scalar iteration cost: 6
> >   Scalar outside cost: 0
> >   Vector outside cost: 4
> >   prologue iterations: 0
> >   epilogue iterations: 0
> >   Calculated minimum iters for profitability: 2
> > note:    Runtime profitability threshold = 4
> > note:    Static estimate profitability threshold = 4
> >
> > This patch, changes it so that when doing inner-loop costing consider all
> > statements for throughput costing.  Typically scalar shouldn't have an 
> > epilogue
> > cost but this also covers any future tweaks to the vectorizer costing where 
> > it
> > might be used as a similar trick for costing.  After this patch the 
> > throughputs
> > for scalar innner-loop vect are now correct:
> >
> > note:  operating on full vectors.
> > note:  Original vector body cost = 500
> > note:  Vector loop iterates at most 25000 times
> > note:  Scalar issue estimate:
> > note:    load operations = 50
> > note:    store operations = 0
> > note:    general operations = 50
> > note:    reduction latency = 0
> > note:    estimated min cycles per iteration = 16.666667
> > note:    estimated cycles per vector iteration (for VF 4) = 66.666667
> > note:  Vector issue estimate:
> > note:    load operations = 1
> > note:    store operations = 0
> > note:    general operations = 2
> > note:    reduction latency = 0
> > note:    estimated min cycles per iteration = 1.000000
> >
> > The cost multiplier of 50 is due to a generic multiplier the vectorizer 
> > applies
> > to inner loop costing.
> 
> I don't really understand why the scalar costing code is written
> the way that it is.  It uses the count parameter to apply a factor of
> LOOP_VINFO_INNER_LOOP_COST_FACTOR for the inner loop, but then uses
> vect_prologue rather than vect_body to prevent adjust_cost_for_freq from
> applying the same factor:
> 
>   /* FORNOW.  */
>   innerloop_iters = 1;
>   if (loop->inner)
>     innerloop_iters = LOOP_VINFO_INNER_LOOP_COST_FACTOR (loop_vinfo);
> 
>   for (i = 0; i < nbbs; i++)
>     {
>       ...
>       basic_block bb = bbs[i];
> 
>       if (bb->loop_father == loop->inner)
>         factor = innerloop_iters;
>       else
>         factor = 1;
> 
>           ...
>         /* We are using vect_prologue here to avoid scaling twice
>            by the inner loop factor.  */
>         record_stmt_cost (&LOOP_VINFO_SCALAR_ITERATION_COST
> (loop_vinfo),
>                           factor, kind, stmt_info, 0, vect_prologue);
> 
> Why not just use vect_body rather than vect_prologue for the inner loop,
> and pass 1 instead of factor for all cases?  That seems like it gives the
> target more information and would avoid both the original problem and
> the *50 in the throughput calculations above.
>


Sure, when I looked into the history the change was made for
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84512

> I fear that if we go with the patch as-is, the scalar througput costs
> will be so inflated that we'll need to find excuses to inflate the
> vector code by the same amount.
> 

Sure, as I mentioned in the cover letter, today the costing is completely
out of wack, because the factor is being used for the latency calculations
already.

*_7 1 times scalar_load costs 4 in prologue
*_3 50 times scalar_load costs 200 in prologue
_4 * _8 50 times scalar_stmt costs 50 in prologue
_9 + res_30 50 times scalar_stmt costs 0 in prologue
res_45 1 times scalar_store costs 1 in prologue

with the results

  Vector inside of loop cost: 505
  Vector prologue cost: 6
  Vector epilogue cost: 0
  Scalar iteration cost: 255
  Scalar outside cost: 0
  Vector outside cost: 6
  prologue iterations: 0
  epilogue iterations: 0
  Minimum number of vector iterations: 1

The patch merely attempts to make the throughput code match the
latency restrictions.  It seems more wrong to me to not do it because
the throughput code says scalars are free and the latency says scalars
are bloody expensive.

So this is just fixing that disparity and using what the middle end says.

Thanks,
Tamar

> Thanks,
> Richard
> 
> >
> > Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
> >
> > Ok for master?
> >
> > Thanks,
> > Tamar
> >
> > gcc/ChangeLog:
> >
> >     PR target/121290
> >     * config/aarch64/aarch64.cc (aarch64_vector_costs::add_stmt_cost):
> >     Include prologue costing when costing inner loop scalar stmts.
> >
> > gcc/testsuite/ChangeLog:
> >
> >     PR target/121290
> >     * gcc.target/aarch64/pr121290.c: New test.
> >
> > ---
> >
> > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> > index
> fb8311b655d7300ce22215789437da777b3779ed..1278d563d5efa08b6a9871f
> 69b63f0786877bf99 100644
> > --- a/gcc/config/aarch64/aarch64.cc
> > +++ b/gcc/config/aarch64/aarch64.cc
> > @@ -18060,7 +18060,14 @@ aarch64_vector_costs::add_stmt_cost (int count,
> vect_cost_for_stmt kind,
> >    fractional_cost stmt_cost
> >      = aarch64_builtin_vectorization_cost (kind, vectype, misalign);
> >
> > -  bool in_inner_loop_p = (where == vect_body
> > +  /* When costing for scalars the vectorizer passes them as vect_prologue 
> > cost
> > +     instead of vect_body to avoid double counting of scaled costs.  But 
> > because
> > +     they are passed as vect_prologue our inner loop detection thinks that 
> > all
> > +     the loop statements are outside the inner loop.  As such the issue 
> > rate
> > +     information for inner loops become zero for scalars.  Therefore 
> > accept any
> > +     scalar statement, relardless of the where as part of the innerloop
> > +     body.  */
> > +  bool in_inner_loop_p = ((where == vect_body || m_costing_for_scalar)
> >                       && stmt_info
> >                       && stmt_in_inner_loop_p (m_vinfo, stmt_info));
> >
> > diff --git a/gcc/testsuite/gcc.target/aarch64/pr121290.c
> b/gcc/testsuite/gcc.target/aarch64/pr121290.c
> > new file mode 100644
> > index
> 0000000000000000000000000000000000000000..c88cb7e6979ef43ebaf14c3
> d3f3c4c561bff3e76
> > --- /dev/null
> > +++ b/gcc/testsuite/gcc.target/aarch64/pr121290.c
> > @@ -0,0 +1,17 @@
> > +/* { dg-do compile } */
> > +/* { dg-additional-options "-O3 -mcpu=neoverse-v2 -fdump-tree-vect-all -
> std=c99" } */
> > +
> > +void
> > +f (int *restrict x, int *restrict y, int *restrict z, int n)
> > +{
> > +  for (int i = 0; i < 4; ++i)
> > +    {
> > +      int res = 0;
> > +      for (int j = 0; j < 100; ++j)
> > +        res += y[j] * z[i];
> > +      x[i] = res;
> > +    }
> > +}
> > +
> > +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
> > +/* { dg-final { scan-tree-dump-not "load operations = 0" "vect" } } */

RE: [PATCH 1/2]AArch64: Fix costing of scalar throughput based calculation for inner loops [PR121290]

Reply via email to