> On 28 Aug 2024, at 11:04, Richard Sandiford <richard.sandif...@arm.com> wrote:
> 
> External email: Use caution opening links or attachments
> 
> 
> Tamar Christina <tamar.christ...@arm.com> writes:
>>>>>>> The patch was bootstrapped and regtested on aarch64-linux-gnu:
>>>>>>> No problems bootstrapping, but several test files (in aarch64-sve.exp:
>>>>>>> gather_load_extend_X.c
>>>>>>> where X is 1 to 4, strided_load_2.c, strided_store_2.c) fail because of 
>>>>>>> small
>>>>>>> differences
>>>>>>> in codegen that make some of the scan-assembler-times tests fail.
>>>>>>> 
>>>>>>> Kyrill suggested to add a -fvect-cost-model=unlimited flag to these 
>>>>>>> tests and
>>>>> add
>>>>>>> some
>>>>>> 
>>>>>> I don't personally like unlimited here as unlimited means just vectorize 
>>>>>> at any
>>>>>> cost.  This means that costing between modes are also disabled. A lot of 
>>>>>> these
>>>>>> testcases are intended to test not just that we vectorize but that we 
>>>>>> vectorize
>>>>>> with efficient code.
>>>>>> 
>>>>>> I'd prefer to use -fvect-cost-model=dynamic if that fixes the testcases.
>>>>> 
>>>>> Yeah, I don't think -fvect-cost-model=unlimited would work for the
>>>>> gather_load_extend_X.c tests, since we require the cost model to decide
>>>>> when to use extending loads vs loading a full vector and unpacking.
>>> 
>>> I had suggested using -fvect-cost-model=unlimited here because I thought 
>>> these
>>> tests wanted to test the capability of GCC to detect and generate the 
>>> particular SVE
>>> feature (gather load extends) for all supported data types regardless of
>>> profitability.
>> 
>> I think the problem only specifically for the load_extend tests, because 
>> unlimited also
>> disables SVE vs SVE comparisons wrt to changes to VF, so not just Adv. SIMD 
>> vs SVE.
>> That is while it would generate a gather, it may choose to instead of gather 
>> from
>> B -> D, do B -> S and then extend the results.  Because this has a higher VF.
>> 
>> Without the cross SVE mode comparisons it wouldn't know that the extensions 
>> would
>> actually slow it down.
> 
> Right.
> 
>>> If the tests are intended to also make the right profitability decisions 
>>> for the generic
>>> tuning model then I agree using -fvect-cost-model=unlimited is not 
>>> appropriate
>>> here, though I do think that it’d be useful to fix the backend vector cost 
>>> model
>>> hooks to honor -fvect-cost-model=unlimited and not limit generation of
>>> gather/scatter in that case. What it should do for the SVE vs Neon 
>>> decisions is an
>>> open question.
> 
> FTR, most of the tests are instead testing specific code generation
> strategies, regardless of whether those strategies are good for
> particular real cores.  So forcing the choice is ok in principle.
> It's just that -fvect-cost-model=unlimited wouldn't do that in this case.
> 
>>>>> [...tries patch...]
>>>>> 
>>>>> It seems like three main things are contributing to the difference:
>>>>> 
>>>>> 1. we now cost 0 for a scalar prologue extension from a loaded value
>>>>> 2. we now cost gathers & scatters based on gather_load_x32/x64_cost
>>>>> 3. we now apply a large one-off cost for gathers (Tamar's recent change)
>>>>> 
>>>>> (1) is expected.
>>>>> 
>>>>> (2) partly looks like a latent bug.  We're using the x32 cost for
>>>>> VNx2QI->VNx2SI, even though those are really .B->.D loads.
>>>>> 
>>>>> @@ -16819,7 +16811,7 @@ aarch64_detect_vector_stmt_subtype (vec_info
>>>>> *vinfo, vect_cost_for_stmt kind,
>>>>>      && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) ==
>>>>> VMAT_GATHER_SCATTER)
>>>>>    {
>>>>>      unsigned int nunits = vect_nunits_for_cost (vectype);
>>>>> -      if (GET_MODE_UNIT_BITSIZE (TYPE_MODE (vectype)) == 64)
>>>>> +      if (known_eq (GET_MODE_NUNITS (TYPE_MODE (vectype)),
>>>>> aarch64_sve_vg))
>>>>>     return { sve_costs->gather_load_x64_cost, nunits };
>>>>>      return { sve_costs->gather_load_x32_cost, nunits };
>>>>>    }
>>>>> 
>>>>> fixes that.
>>> 
>>> Would you be willing to test and push that to trunk to get it out of the 
>>> way?
> 
> Sure.
> 
>>>>> (3) is interesting.  generic_vector_cost isn't used by default for any
>>>>> SVE CPU, or any -march=...+sve.  So the question is: should we treat it
>>>>> as "architectural intent"/somewhat idealised?  Or should we try to make
>>>>> it produce good code for existing SVE cores, in which case it would
>>>>> overlap quite a lot with generic_armv8_a and generic_armv9_a.
>>> 
>>>>> 
>>>>> If the former, we could set gather_load_x32_init_cost and
>>>>> gather_load_x64_init_cost to 0 for generic_sve_vector_cost
>>>>> (and nothing else, so that generic_armv*_a are unaffected).
>>> 
>>> I don’t have strong opinions on this point but if Tamar is concerned about
>>> deviating too much from the known-good Advanced SIMD generic tuning we have
>>> now then we should aim for minimal codegen changes in that?
>> 
>> No I'm fine with Richard's suggestion.  The only way you'd be able to get 
>> this model
>> to fire is with -march=xxx=sve -mtune=generic, at which point I'm fine with 
>> assuming
>> gathers have no initial overhead.
> 
> OK, great.  Should I do this as well?  I suppose it's a separate patch
> from removing the "new cost model" flag.
> 

That would be great.
Thanks!
Kyrill

> Richard

Reply via email to