Re: [RFC][PATCH] AArch64: Remove AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS

Richard Sandiford Tue, 27 Aug 2024 03:46:27 -0700

Tamar Christina <tamar.christ...@arm.com> writes:
> Hi Jennifer,
>
>> -----Original Message-----
>> From: Jennifer Schmitz <jschm...@nvidia.com>
>> Sent: Friday, August 23, 2024 1:07 PM
>> To: gcc-patches@gcc.gnu.org
>> Cc: Richard Sandiford <richard.sandif...@arm.com>; Kyrylo Tkachov
>> <ktkac...@exchange.nvidia.com>
>> Subject: [RFC][PATCH] AArch64: Remove
>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>
>> This patch removes the AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>> tunable and
>> use_new_vector_costs entry in aarch64-tuning-flags.def and makes the
>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS paths in the backend the
>> default.


Thanks for doing this.  This has been on my TODO list ever since the
tunable was added.

The history is that these "new" costs were originally added in stage 4
of GCC 11 for Neoverse V1.  Since the costs were added so late, it wasn't
appropriate to change the behaviour for any other core.  All the new code
was therefore gated on this option.

The new costs do two main things:

(1) use throughput-based calculations where available, including to choose
    between Advanced SIMD and SVE

(2) try to make the latency-based costs more precise, by looking more closely
    at the provided stmt_info

Old cost models won't be affected by (1) either way, since they don't
provide any throughput information.  But they should in principle benefit
from (2).  So...

>> To that end, the function aarch64_use_new_vector_costs_p and its uses were
>> removed. Additionally, guards were added prevent nullpointer dereferences of
>> fields in cpu_vector_cost.
>>
>
> I'm not against this change, but it does mean that we now switch old Adv. SIMD
> cost models as well to the new throughput based cost models.  That means that
> -mcpu=generic now behaves differently, and -mcpu=neoverse-n1 and I think
> some distros explicitly use this (I believe yocto for instance does).

...it shouldn't mean that we start using throughput-based models for
cortexa53 etc., since there's no associated issue info.

> Have we validated that the old generic cost model still behaves sensibly with 
> this change?
>
>> The patch was bootstrapped and regtested on aarch64-linux-gnu:
>> No problems bootstrapping, but several test files (in aarch64-sve.exp:
>> gather_load_extend_X.c
>> where X is 1 to 4, strided_load_2.c, strided_store_2.c) fail because of small
>> differences
>> in codegen that make some of the scan-assembler-times tests fail.
>>
>> Kyrill suggested to add a -fvect-cost-model=unlimited flag to these tests 
>> and add
>> some
>
> I don't personally like unlimited here as unlimited means just vectorize at 
> any
> cost.  This means that costing between modes are also disabled. A lot of these
> testcases are intended to test not just that we vectorize but that we 
> vectorize
> with efficient code.
>
> I'd prefer to use -fvect-cost-model=dynamic if that fixes the testcases.

Yeah, I don't think -fvect-cost-model=unlimited would work for the
gather_load_extend_X.c tests, since we require the cost model to decide
when to use extending loads vs loading a full vector and unpacking.

[...tries patch...]

It seems like three main things are contributing to the difference:

1. we now cost 0 for a scalar prologue extension from a loaded value
2. we now cost gathers & scatters based on gather_load_x32/x64_cost
3. we now apply a large one-off cost for gathers (Tamar's recent change)

(1) is expected.

(2) partly looks like a latent bug.  We're using the x32 cost for
VNx2QI->VNx2SI, even though those are really .B->.D loads.

@@ -16819,7 +16811,7 @@ aarch64_detect_vector_stmt_subtype (vec_info *vinfo, 
vect_cost_for_stmt kind,
       && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_GATHER_SCATTER)
     {
       unsigned int nunits = vect_nunits_for_cost (vectype);
-      if (GET_MODE_UNIT_BITSIZE (TYPE_MODE (vectype)) == 64)
+      if (known_eq (GET_MODE_NUNITS (TYPE_MODE (vectype)), aarch64_sve_vg))
        return { sve_costs->gather_load_x64_cost, nunits };
       return { sve_costs->gather_load_x32_cost, nunits };
     }

fixes that.

(3) is interesting.  generic_vector_cost isn't used by default for any
SVE CPU, or any -march=...+sve.  So the question is: should we treat it
as "architectural intent"/somewhat idealised?  Or should we try to make
it produce good code for existing SVE cores, in which case it would
overlap quite a lot with generic_armv8_a and generic_armv9_a.

If the former, we could set gather_load_x32_init_cost and
gather_load_x64_init_cost to 0 for generic_sve_vector_cost
(and nothing else, so that generic_armv*_a are unaffected).

On the patch:

> @@ -16733,7 +16723,8 @@ aarch64_in_loop_reduction_latency (vec_info *vinfo, 
> stmt_vec_info stmt_info,
>  {
>    const cpu_vector_cost *vec_costs = aarch64_tune_params.vec_costs;
>    const sve_vec_cost *sve_costs = nullptr;
> -  if (vec_flags & VEC_ANY_SVE)
> +  if (vec_costs->sve
> +      && vec_flags & VEC_ANY_SVE)
>      sve_costs = aarch64_tune_params.vec_costs->sve;

This doesn't seem necessary.  I agree we might as well reuse the
vec_costs local variable in the "if" body though.
 
> [...]
> @@ -16794,9 +16785,10 @@ aarch64_detect_vector_stmt_subtype (vec_info *vinfo, 
> vect_cost_for_stmt kind,
>                                   fractional_cost stmt_cost)
>  {
>    const simd_vec_cost *simd_costs = aarch64_simd_vec_costs (vectype);
> +  const cpu_vector_cost *vec_costs = aarch64_tune_params.vec_costs;
>    const sve_vec_cost *sve_costs = nullptr;
> -  if (aarch64_sve_mode_p (TYPE_MODE (vectype)))
> -    sve_costs = aarch64_tune_params.vec_costs->sve;
> +  if (aarch64_sve_mode_p (TYPE_MODE (vectype)) && vec_costs->sve)
> +    sve_costs = vec_costs->sve;

Similarly here, this doesn't seem necessary.
 
> [...]
> @@ -17301,11 +17293,13 @@ aarch64_vector_costs::add_stmt_cost (int count, 
> vect_cost_for_stmt kind,
>                                                       where, stmt_cost);
 
>        /* Check if we've seen an SVE gather/scatter operation and which size. 
>  */
> +      const cpu_vector_cost *vec_costs = aarch64_tune_params.vec_costs;
>        if (kind == scalar_load
>         && aarch64_sve_mode_p (TYPE_MODE (vectype))
> -       && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_GATHER_SCATTER)
> +       && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == VMAT_GATHER_SCATTER
> +       && vec_costs->sve)
>       {
> -       const sve_vec_cost *sve_costs = aarch64_tune_params.vec_costs->sve;
> +       const sve_vec_cost *sve_costs = vec_costs->sve;
>         if (sve_costs)
>           {
>             if (GET_MODE_UNIT_BITSIZE (TYPE_MODE (vectype)) == 64)

Here too.

Thanks,
Richard

Re: [RFC][PATCH] AArch64: Remove AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS

Reply via email to