Tamar Christina <tamar.christ...@arm.com> writes:
> To avoid double counting scalar instructions when doing inner-loop costing the
> vectorizer uses vect_prologue as the kind instead of vect_body.
>
> However doing this results in our throughput based costing to think the scalar
> loop issues in 0 cycles (rounded up to 1.0).  The latency costs however are 
> fine
>
> note:  operating on full vectors.
> note:  Original vector body cost = 10
> note:  Vector loop iterates at most 25000 times
> note:  Scalar issue estimate:
> note:    load operations = 0
> note:    store operations = 0
> note:    general operations = 0
> note:    reduction latency = 0
> note:    estimated min cycles per iteration = 1.000000
> note:    estimated cycles per vector iteration (for VF 4) = 4.000000
> note:  Vector issue estimate:
> note:    load operations = 1
> note:    store operations = 0
> note:    general operations = 2
> note:    reduction latency = 0
> note:    estimated min cycles per iteration = 1.000000
> note:  Cost model analysis:
>   Vector inside of loop cost: 10
>   Vector prologue cost: 4
>   Vector epilogue cost: 0
>   Scalar iteration cost: 6
>   Scalar outside cost: 0
>   Vector outside cost: 4
>   prologue iterations: 0
>   epilogue iterations: 0
>   Calculated minimum iters for profitability: 2
> note:    Runtime profitability threshold = 4
> note:    Static estimate profitability threshold = 4
>
> This patch, changes it so that when doing inner-loop costing consider all
> statements for throughput costing.  Typically scalar shouldn't have an 
> epilogue
> cost but this also covers any future tweaks to the vectorizer costing where it
> might be used as a similar trick for costing.  After this patch the 
> throughputs
> for scalar innner-loop vect are now correct:
>
> note:  operating on full vectors.
> note:  Original vector body cost = 500
> note:  Vector loop iterates at most 25000 times
> note:  Scalar issue estimate:
> note:    load operations = 50
> note:    store operations = 0
> note:    general operations = 50
> note:    reduction latency = 0
> note:    estimated min cycles per iteration = 16.666667
> note:    estimated cycles per vector iteration (for VF 4) = 66.666667
> note:  Vector issue estimate:
> note:    load operations = 1
> note:    store operations = 0
> note:    general operations = 2
> note:    reduction latency = 0
> note:    estimated min cycles per iteration = 1.000000
>
> The cost multiplier of 50 is due to a generic multiplier the vectorizer 
> applies
> to inner loop costing.

I don't really understand why the scalar costing code is written
the way that it is.  It uses the count parameter to apply a factor of
LOOP_VINFO_INNER_LOOP_COST_FACTOR for the inner loop, but then uses
vect_prologue rather than vect_body to prevent adjust_cost_for_freq from
applying the same factor:

  /* FORNOW.  */
  innerloop_iters = 1;
  if (loop->inner)
    innerloop_iters = LOOP_VINFO_INNER_LOOP_COST_FACTOR (loop_vinfo);

  for (i = 0; i < nbbs; i++)
    {
      ...
      basic_block bb = bbs[i];

      if (bb->loop_father == loop->inner)
        factor = innerloop_iters;
      else
        factor = 1;

          ...
          /* We are using vect_prologue here to avoid scaling twice
             by the inner loop factor.  */
          record_stmt_cost (&LOOP_VINFO_SCALAR_ITERATION_COST (loop_vinfo),
                            factor, kind, stmt_info, 0, vect_prologue);

Why not just use vect_body rather than vect_prologue for the inner loop,
and pass 1 instead of factor for all cases?  That seems like it gives the
target more information and would avoid both the original problem and
the *50 in the throughput calculations above.

I fear that if we go with the patch as-is, the scalar througput costs
will be so inflated that we'll need to find excuses to inflate the
vector code by the same amount.

Thanks,
Richard

>
> Bootstrapped Regtested on aarch64-none-linux-gnu and no issues.
>
> Ok for master?
>
> Thanks,
> Tamar
>
> gcc/ChangeLog:
>
>       PR target/121290
>       * config/aarch64/aarch64.cc (aarch64_vector_costs::add_stmt_cost):
>       Include prologue costing when costing inner loop scalar stmts.
>
> gcc/testsuite/ChangeLog:
>
>       PR target/121290
>       * gcc.target/aarch64/pr121290.c: New test.
>
> ---
>
> diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> index 
> fb8311b655d7300ce22215789437da777b3779ed..1278d563d5efa08b6a9871f69b63f0786877bf99
>  100644
> --- a/gcc/config/aarch64/aarch64.cc
> +++ b/gcc/config/aarch64/aarch64.cc
> @@ -18060,7 +18060,14 @@ aarch64_vector_costs::add_stmt_cost (int count, 
> vect_cost_for_stmt kind,
>    fractional_cost stmt_cost
>      = aarch64_builtin_vectorization_cost (kind, vectype, misalign);
>  
> -  bool in_inner_loop_p = (where == vect_body
> +  /* When costing for scalars the vectorizer passes them as vect_prologue 
> cost
> +     instead of vect_body to avoid double counting of scaled costs.  But 
> because
> +     they are passed as vect_prologue our inner loop detection thinks that 
> all
> +     the loop statements are outside the inner loop.  As such the issue rate
> +     information for inner loops become zero for scalars.  Therefore accept 
> any
> +     scalar statement, relardless of the where as part of the innerloop
> +     body.  */
> +  bool in_inner_loop_p = ((where == vect_body || m_costing_for_scalar)
>                         && stmt_info
>                         && stmt_in_inner_loop_p (m_vinfo, stmt_info));
>  
> diff --git a/gcc/testsuite/gcc.target/aarch64/pr121290.c 
> b/gcc/testsuite/gcc.target/aarch64/pr121290.c
> new file mode 100644
> index 
> 0000000000000000000000000000000000000000..c88cb7e6979ef43ebaf14c3d3f3c4c561bff3e76
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/pr121290.c
> @@ -0,0 +1,17 @@
> +/* { dg-do compile } */
> +/* { dg-additional-options "-O3 -mcpu=neoverse-v2 -fdump-tree-vect-all 
> -std=c99" } */
> +
> +void
> +f (int *restrict x, int *restrict y, int *restrict z, int n)
> +{
> +  for (int i = 0; i < 4; ++i)
> +    {
> +      int res = 0;
> +      for (int j = 0; j < 100; ++j)
> +        res += y[j] * z[i];
> +      x[i] = res;
> +    }
> +}
> +
> +/* { dg-final { scan-tree-dump "LOOP VECTORIZED" "vect" } } */
> +/* { dg-final { scan-tree-dump-not "load operations = 0" "vect" } } */

Reply via email to