On Fri, Apr 17, 2020 at 5:21 PM Richard Sandiford
<richard.sandif...@arm.com> wrote:
>
> This patch fixes a large lmbench performance regression with
> 128-bit SVE, compiled in length-agnostic mode.
>
> vect_better_loop_vinfo_p (new in GCC 10) tries to estimate whether
> a new loop_vinfo is cheaper than a previous one, with an in-built
> preference for the old one.  For variable VF it prefers the old
> loop_vinfo if it is cheaper for at least one VF.  However, we have
> no idea how likely that VF is in practice.
>
> Another extreme would be to do what most of the rest of the
> vectoriser does, and rely solely on the constant estimated VF.
> But as noted in the comment, this means that a one-unit cost
> difference would be enough to pick the new loop_vinfo,
> despite the target generally preferring the old loop_vinfo
> where possible.  The cost model just isn't accurate enough
> for that to produce good results as things stand: there might
> not be any practical benefit to the new loop_vinfo at the
> estimated VF, and it would be significantly worse for higher VFs.
>
> The patch instead goes for a hacky compromise: make sure that the new
> loop_vinfo is also no worse than the old loop_vinfo at double the
> estimated VF.  For all but trivial loops, this ensures that the
> new loop_vinfo is only chosen if it is better than the old one
> by a non-trivial amount at the estimated VF.  It also avoids
> putting too much faith in the VF estimate.
>
> I realise this isn't great, but it's supposed to be a conservative fix
> suitable for stage 4.  The only affected testcases are the ones for
> pr89007-*.c, where Advanced SIMD is indeed preferred for 128-bit SVE
> and is no worse for 256-bit SVE.
>
> Part of the problem here is that if the new loop_vinfo is better,
> we discard the old one and never consider using it even as an
> epilogue loop.  This means that if we choose Advanced SIMD over SVE,
> we're much more likely to have left-over scalar elements.
>
> Another is that the estimate provided by estimated_poly_value might have
> different probabilities attached.  E.g. when tuning for a particular core,
> the estimate is probably accurate, but when tuning for generic code,
> the estimate is more of a guess.  Relying solely on the estimate is
> probably correct for the former but not for the latter.
>
> Hopefully those are things that we could tackle in GCC 11.
>
> Tested on aarch64-linux-gnu, aarch64_be-elf and x86_64-linux-gnu.
> OK to install?

OK.

Richard.

> Richard
>
>
> 2020-04-17  Richard Sandiford  <richard.sandif...@arm.com>
>
> gcc/
>         * tree-vect-loop.c (vect_better_loop_vinfo_p): If old_loop_vinfo
>         has a variable VF, prefer new_loop_vinfo if it is cheaper for the
>         estimated VF and is no worse at double the estimated VF.
>
> gcc/testsuite/
>         * gcc.target/aarch64/sve/cost_model_8.c: New test.
>         * gcc.target/aarch64/sve/cost_model_9.c: Likewise.
>         * gcc.target/aarch64/sve/pr89007-1.c: Add -msve-vector-bits=512.
>         * gcc.target/aarch64/sve/pr89007-2.c: Likewise.
> ---
>  .../gcc.target/aarch64/sve/cost_model_8.c     | 12 +++++++
>  .../gcc.target/aarch64/sve/cost_model_9.c     | 13 ++++++++
>  .../gcc.target/aarch64/sve/pr89007-1.c        |  2 +-
>  .../gcc.target/aarch64/sve/pr89007-2.c        |  2 +-
>  gcc/tree-vect-loop.c                          | 31 ++++++++++++++++++-
>  5 files changed, 57 insertions(+), 3 deletions(-)
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/cost_model_8.c
>  create mode 100644 gcc/testsuite/gcc.target/aarch64/sve/cost_model_9.c
>
> diff --git a/gcc/tree-vect-loop.c b/gcc/tree-vect-loop.c
> index 265bcfdc5af..b6c3faeae51 100644
> --- a/gcc/tree-vect-loop.c
> +++ b/gcc/tree-vect-loop.c
> @@ -2414,7 +2414,36 @@ vect_better_loop_vinfo_p (loop_vec_info new_loop_vinfo,
>    poly_widest_int rel_old = (old_loop_vinfo->vec_inside_cost
>                              * poly_widest_int (new_vf));
>    if (maybe_lt (rel_old, rel_new))
> -    return false;
> +    {
> +      /* When old_loop_vinfo uses a variable vectorization factor,
> +        we know that it has a lower cost for at least one runtime VF.
> +        However, we don't know how likely that VF is.
> +
> +        One option would be to compare the costs for the estimated VFs.
> +        The problem is that that can put too much pressure on the cost
> +        model.  E.g. if the estimated VF is also the lowest possible VF,
> +        and if old_loop_vinfo is 1 unit worse than new_loop_vinfo
> +        for the estimated VF, we'd then choose new_loop_vinfo even
> +        though (a) new_loop_vinfo might not actually be better than
> +        old_loop_vinfo for that VF and (b) it would be significantly
> +        worse at larger VFs.
> +
> +        Here we go for a hacky compromise: pick new_loop_vinfo if it is
> +        no more expensive than old_loop_vinfo even after doubling the
> +        estimated old_loop_vinfo VF.  For all but trivial loops, this
> +        ensures that we only pick new_loop_vinfo if it is significantly
> +        better than old_loop_vinfo at the estimated VF.  */
> +      if (rel_new.is_constant ())
> +       return false;
> +
> +      HOST_WIDE_INT new_estimated_vf = estimated_poly_value (new_vf);
> +      HOST_WIDE_INT old_estimated_vf = estimated_poly_value (old_vf);
> +      widest_int estimated_rel_new = (new_loop_vinfo->vec_inside_cost
> +                                     * widest_int (old_estimated_vf));
> +      widest_int estimated_rel_old = (old_loop_vinfo->vec_inside_cost
> +                                     * widest_int (new_estimated_vf));
> +      return estimated_rel_new * 2 <= estimated_rel_old;
> +    }
>    if (known_lt (rel_new, rel_old))
>      return true;
>
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cost_model_8.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/cost_model_8.c
> new file mode 100644
> index 00000000000..80c3a23e18a
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/cost_model_8.c
> @@ -0,0 +1,12 @@
> +/* { dg-options "-O3 -msve-vector-bits=scalable" } */
> +
> +void
> +vset (int *restrict dst, int *restrict src, int count)
> +{
> +  for (int i = 0; i < count; ++i)
> +#pragma GCC unroll 4
> +    for (int j = 0; j < 4; ++j)
> +      *dst++ = 1;
> +}
> +
> +/* { dg-final { scan-assembler-times {\tst1w\tz} 1 } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/cost_model_9.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/cost_model_9.c
> new file mode 100644
> index 00000000000..e7a1bac3c83
> --- /dev/null
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/cost_model_9.c
> @@ -0,0 +1,13 @@
> +/* { dg-options "-O3 -msve-vector-bits=scalable" } */
> +
> +void
> +vset (int *restrict dst, int *restrict src, int count)
> +{
> +  for (int i = 0; i < count; ++i)
> +#pragma GCC unroll 8
> +    for (int j = 0; j < 8; ++j)
> +      *dst++ = 1;
> +}
> +
> +/* { dg-final { scan-assembler-not {\tst1w\tz} } } */
> +/* { dg-final { scan-assembler-times {\tstp\tq} 1 } } */
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pr89007-1.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/pr89007-1.c
> index af4aff4ec6d..ff9550c9109 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/pr89007-1.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/pr89007-1.c
> @@ -1,5 +1,5 @@
>  /* { dg-do assemble { target aarch64_asm_sve_ok } } */
> -/* { dg-options "-O -ftree-vectorize -march=armv8.2-a+sve --save-temps" } */
> +/* { dg-options "-O -ftree-vectorize -march=armv8.2-a+sve 
> -msve-vector-bits=512 --save-temps" } */
>  /* { dg-final { check-function-bodies "**" "" } } */
>
>  #define N 1024
> diff --git a/gcc/testsuite/gcc.target/aarch64/sve/pr89007-2.c 
> b/gcc/testsuite/gcc.target/aarch64/sve/pr89007-2.c
> index 2ccdd0d353e..da345fe8bd6 100644
> --- a/gcc/testsuite/gcc.target/aarch64/sve/pr89007-2.c
> +++ b/gcc/testsuite/gcc.target/aarch64/sve/pr89007-2.c
> @@ -1,5 +1,5 @@
>  /* { dg-do assemble { target aarch64_asm_sve_ok } } */
> -/* { dg-options "-O -ftree-vectorize -march=armv8.2-a+sve --save-temps" } */
> +/* { dg-options "-O -ftree-vectorize -march=armv8.2-a+sve 
> -msve-vector-bits=512 --save-temps" } */
>  /* { dg-final { check-function-bodies "**" "" } } */
>
>  #define N 1024

Reply via email to