> On 28 Aug 2024, at 10:27, Tamar Christina <tamar.christ...@arm.com> wrote:
> 
> External email: Use caution opening links or attachments
> 
> 
>> -----Original Message-----
>> From: Kyrylo Tkachov <ktkac...@nvidia.com>
>> Sent: Wednesday, August 28, 2024 8:55 AM
>> To: Tamar Christina <tamar.christ...@arm.com>
>> Cc: Richard Sandiford <richard.sandif...@arm.com>; Jennifer Schmitz
>> <jschm...@nvidia.com>; gcc-patches@gcc.gnu.org; Kyrylo Tkachov
>> <ktkac...@exchange.nvidia.com>
>> Subject: Re: [RFC][PATCH] AArch64: Remove
>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>> 
>> Hi all,
>> 
>> Thanks to Jennifer for proposing a patch and Tamar and Richard for digging 
>> into it.
>> 
>>> On 27 Aug 2024, at 13:16, Tamar Christina <tamar.christ...@arm.com> wrote:
>>> 
>>> External email: Use caution opening links or attachments
>>> 
>>> 
>>>> -----Original Message-----
>>>> From: Richard Sandiford <richard.sandif...@arm.com>
>>>> Sent: Tuesday, August 27, 2024 11:46 AM
>>>> To: Tamar Christina <tamar.christ...@arm.com>
>>>> Cc: Jennifer Schmitz <jschm...@nvidia.com>; gcc-patches@gcc.gnu.org; Kyrylo
>>>> Tkachov <ktkac...@exchange.nvidia.com>
>>>> Subject: Re: [RFC][PATCH] AArch64: Remove
>>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>> 
>>>> Tamar Christina <tamar.christ...@arm.com> writes:
>>>>> Hi Jennifer,
>>>>> 
>>>>>> -----Original Message-----
>>>>>> From: Jennifer Schmitz <jschm...@nvidia.com>
>>>>>> Sent: Friday, August 23, 2024 1:07 PM
>>>>>> To: gcc-patches@gcc.gnu.org
>>>>>> Cc: Richard Sandiford <richard.sandif...@arm.com>; Kyrylo Tkachov
>>>>>> <ktkac...@exchange.nvidia.com>
>>>>>> Subject: [RFC][PATCH] AArch64: Remove
>>>>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>> 
>>>>>> This patch removes the AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS
>>>>>> tunable and
>>>>>> use_new_vector_costs entry in aarch64-tuning-flags.def and makes the
>>>>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS paths in the backend
>> the
>>>>>> default.
>>>> 
>>>> Thanks for doing this.  This has been on my TODO list ever since the
>>>> tunable was added.
>>>> 
>>>> The history is that these "new" costs were originally added in stage 4
>>>> of GCC 11 for Neoverse V1.  Since the costs were added so late, it wasn't
>>>> appropriate to change the behaviour for any other core.  All the new code
>>>> was therefore gated on this option.
>>>> 
>>>> The new costs do two main things:
>>>> 
>>>> (1) use throughput-based calculations where available, including to choose
>>>>   between Advanced SIMD and SVE
>>>> 
>>>> (2) try to make the latency-based costs more precise, by looking more 
>>>> closely
>>>>   at the provided stmt_info
>>>> 
>>>> Old cost models won't be affected by (1) either way, since they don't
>>>> provide any throughput information.  But they should in principle benefit
>>>> from (2).  So...
>>>> 
>>>>>> To that end, the function aarch64_use_new_vector_costs_p and its uses
>> were
>>>>>> removed. Additionally, guards were added prevent nullpointer dereferences
>> of
>>>>>> fields in cpu_vector_cost.
>>>>>> 
>>>>> 
>>>>> I'm not against this change, but it does mean that we now switch old Adv.
>> SIMD
>>>>> cost models as well to the new throughput based cost models.  That means
>> that
>>>>> -mcpu=generic now behaves differently, and -mcpu=neoverse-n1 and I think
>>>>> some distros explicitly use this (I believe yocto for instance does).
>>>> 
>>>> ...it shouldn't mean that we start using throughput-based models for
>>>> cortexa53 etc., since there's no associated issue info.
>>> 
>>> Yes, I was using throughput based model as a name.  But as you indicated in 
>>> (2)
>>> it does change the latency calculation.
>>> 
>>> My question was because of things in e.g. aarch64_adjust_stmt_cost and
>> friends,
>>> e.g. aarch64_multiply_add_p changes the cost between FMA SIMD vs scalar.
>>> 
>>> So my question..
>>> 
>>>> 
>>>>> Have we validated that the old generic cost model still behaves sensibly 
>>>>> with
>> this
>>>> change?
>>> 
>>> is still valid I think, we *are* changing the cost for all models,
>>> and while they should indeed be more accurate, there could be knock on 
>>> effects.
>>> 
>> 
>> We can run SPEC on a Grace system with -mcpu=generic to see what the effect 
>> is,
>> but wider benchmarking would be more appropriate. Can you help with that
>> Tamar once we agree on the other implementation details in this patch?
>> 
> 
> Sure that's not a problem.  Just ping me when you have a patch you want me to 
> test :)
> 
>> 
>>> Thanks,
>>> Tamar
>>> 
>>>>> 
>>>>>> The patch was bootstrapped and regtested on aarch64-linux-gnu:
>>>>>> No problems bootstrapping, but several test files (in aarch64-sve.exp:
>>>>>> gather_load_extend_X.c
>>>>>> where X is 1 to 4, strided_load_2.c, strided_store_2.c) fail because of 
>>>>>> small
>>>>>> differences
>>>>>> in codegen that make some of the scan-assembler-times tests fail.
>>>>>> 
>>>>>> Kyrill suggested to add a -fvect-cost-model=unlimited flag to these 
>>>>>> tests and
>>>> add
>>>>>> some
>>>>> 
>>>>> I don't personally like unlimited here as unlimited means just vectorize 
>>>>> at any
>>>>> cost.  This means that costing between modes are also disabled. A lot of 
>>>>> these
>>>>> testcases are intended to test not just that we vectorize but that we 
>>>>> vectorize
>>>>> with efficient code.
>>>>> 
>>>>> I'd prefer to use -fvect-cost-model=dynamic if that fixes the testcases.
>>>> 
>>>> Yeah, I don't think -fvect-cost-model=unlimited would work for the
>>>> gather_load_extend_X.c tests, since we require the cost model to decide
>>>> when to use extending loads vs loading a full vector and unpacking.
>> 
>> I had suggested using -fvect-cost-model=unlimited here because I thought 
>> these
>> tests wanted to test the capability of GCC to detect and generate the 
>> particular SVE
>> feature (gather load extends) for all supported data types regardless of
>> profitability.
> 
> I think the problem only specifically for the load_extend tests, because 
> unlimited also
> disables SVE vs SVE comparisons wrt to changes to VF, so not just Adv. SIMD 
> vs SVE.
> That is while it would generate a gather, it may choose to instead of gather 
> from
> B -> D, do B -> S and then extend the results.  Because this has a higher VF.
> 
> Without the cross SVE mode comparisons it wouldn't know that the extensions 
> would
> actually slow it down.
> 
>> If the tests are intended to also make the right profitability decisions for 
>> the generic
>> tuning model then I agree using -fvect-cost-model=unlimited is not 
>> appropriate
>> here, though I do think that it’d be useful to fix the backend vector cost 
>> model
>> hooks to honor -fvect-cost-model=unlimited and not limit generation of
>> gather/scatter in that case. What it should do for the SVE vs Neon decisions 
>> is an
>> open question.
>> 
>> 
>>>> 
>>>> [...tries patch...]
>>>> 
>>>> It seems like three main things are contributing to the difference:
>>>> 
>>>> 1. we now cost 0 for a scalar prologue extension from a loaded value
>>>> 2. we now cost gathers & scatters based on gather_load_x32/x64_cost
>>>> 3. we now apply a large one-off cost for gathers (Tamar's recent change)
>>>> 
>>>> (1) is expected.
>>>> 
>>>> (2) partly looks like a latent bug.  We're using the x32 cost for
>>>> VNx2QI->VNx2SI, even though those are really .B->.D loads.
>>>> 
>>>> @@ -16819,7 +16811,7 @@ aarch64_detect_vector_stmt_subtype (vec_info
>>>> *vinfo, vect_cost_for_stmt kind,
>>>>      && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) ==
>>>> VMAT_GATHER_SCATTER)
>>>>    {
>>>>      unsigned int nunits = vect_nunits_for_cost (vectype);
>>>> -      if (GET_MODE_UNIT_BITSIZE (TYPE_MODE (vectype)) == 64)
>>>> +      if (known_eq (GET_MODE_NUNITS (TYPE_MODE (vectype)),
>>>> aarch64_sve_vg))
>>>>     return { sve_costs->gather_load_x64_cost, nunits };
>>>>      return { sve_costs->gather_load_x32_cost, nunits };
>>>>    }
>>>> 
>>>> fixes that.
>> 
>> Would you be willing to test and push that to trunk to get it out of the way?
>> 
>> 
>>>> 
>>>> (3) is interesting.  generic_vector_cost isn't used by default for any
>>>> SVE CPU, or any -march=...+sve.  So the question is: should we treat it
>>>> as "architectural intent"/somewhat idealised?  Or should we try to make
>>>> it produce good code for existing SVE cores, in which case it would
>>>> overlap quite a lot with generic_armv8_a and generic_armv9_a.
>> 
>>>> 
>>>> If the former, we could set gather_load_x32_init_cost and
>>>> gather_load_x64_init_cost to 0 for generic_sve_vector_cost
>>>> (and nothing else, so that generic_armv*_a are unaffected).
>> 
>> I don’t have strong opinions on this point but if Tamar is concerned about
>> deviating too much from the known-good Advanced SIMD generic tuning we have
>> now then we should aim for minimal codegen changes in that?
> 
> No I'm fine with Richard's suggestion.  The only way you'd be able to get 
> this model
> to fire is with -march=xxx=sve -mtune=generic, at which point I'm fine with 
> assuming
> gathers have no initial overhead.

As an aside, Jennifer pointed out to me that running the aarch64-sve.exp tests 
ends up using “-march=armv8.2-a+sve -mtune=generic -moverride=tune=none” for 
the tests but aarch64-sve.exp does’t add that -mtune and -moverride. It seems 
that aarch64-sve-acle-asm.exp does add these flags (to avoid the 
CSE_SVE_VL_CONSTANTS logic IIRC), but it seems odd to use them in the top-level 
aarch64-sve.exp and I’m not sure by that DejaGNU mechanism they end up 
propagating there.

Thanks,
Kyrill

> 
> Cheers,
> Tamar
> 
>> Thanks again,
>> Kyrill
>> 
>> 
>>>> 
>>>> On the patch:
>>>> 
>>>>> @@ -16733,7 +16723,8 @@ aarch64_in_loop_reduction_latency (vec_info
>>>> *vinfo, stmt_vec_info stmt_info,
>>>>> {
>>>>>  const cpu_vector_cost *vec_costs = aarch64_tune_params.vec_costs;
>>>>>  const sve_vec_cost *sve_costs = nullptr;
>>>>> -  if (vec_flags & VEC_ANY_SVE)
>>>>> +  if (vec_costs->sve
>>>>> +      && vec_flags & VEC_ANY_SVE)
>>>>>    sve_costs = aarch64_tune_params.vec_costs->sve;
>>>> 
>>>> This doesn't seem necessary.  I agree we might as well reuse the
>>>> vec_costs local variable in the "if" body though.
>>>> 
>>>>> [...]
>>>>> @@ -16794,9 +16785,10 @@ aarch64_detect_vector_stmt_subtype
>> (vec_info
>>>> *vinfo, vect_cost_for_stmt kind,
>>>>>                               fractional_cost stmt_cost)
>>>>> {
>>>>>  const simd_vec_cost *simd_costs = aarch64_simd_vec_costs (vectype);
>>>>> +  const cpu_vector_cost *vec_costs = aarch64_tune_params.vec_costs;
>>>>>  const sve_vec_cost *sve_costs = nullptr;
>>>>> -  if (aarch64_sve_mode_p (TYPE_MODE (vectype)))
>>>>> -    sve_costs = aarch64_tune_params.vec_costs->sve;
>>>>> +  if (aarch64_sve_mode_p (TYPE_MODE (vectype)) && vec_costs->sve)
>>>>> +    sve_costs = vec_costs->sve;
>>>> 
>>>> Similarly here, this doesn't seem necessary.
>>>> 
>>>>> [...]
>>>>> @@ -17301,11 +17293,13 @@ aarch64_vector_costs::add_stmt_cost (int
>>>> count, vect_cost_for_stmt kind,
>>>>>                                                   where, stmt_cost);
>>>> 
>>>>>      /* Check if we've seen an SVE gather/scatter operation and which 
>>>>> size.  */
>>>>> +      const cpu_vector_cost *vec_costs = aarch64_tune_params.vec_costs;
>>>>>      if (kind == scalar_load
>>>>>     && aarch64_sve_mode_p (TYPE_MODE (vectype))
>>>>> -     && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) ==
>>>> VMAT_GATHER_SCATTER)
>>>>> +     && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) ==
>>>> VMAT_GATHER_SCATTER
>>>>> +     && vec_costs->sve)
>>>>>   {
>>>>> -     const sve_vec_cost *sve_costs = aarch64_tune_params.vec_costs->sve;
>>>>> +     const sve_vec_cost *sve_costs = vec_costs->sve;
>>>>>     if (sve_costs)
>>>>>       {
>>>>>         if (GET_MODE_UNIT_BITSIZE (TYPE_MODE (vectype)) == 64)
>>>> 
>>>> Here too.
>>>> 
>>>> Thanks,
>>>> Richard
> 

Reply via email to