> On 28 Aug 2024, at 10:27, Tamar Christina <tamar.christ...@arm.com> wrote: > > External email: Use caution opening links or attachments > > >> -----Original Message----- >> From: Kyrylo Tkachov <ktkac...@nvidia.com> >> Sent: Wednesday, August 28, 2024 8:55 AM >> To: Tamar Christina <tamar.christ...@arm.com> >> Cc: Richard Sandiford <richard.sandif...@arm.com>; Jennifer Schmitz >> <jschm...@nvidia.com>; gcc-patches@gcc.gnu.org; Kyrylo Tkachov >> <ktkac...@exchange.nvidia.com> >> Subject: Re: [RFC][PATCH] AArch64: Remove >> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >> >> Hi all, >> >> Thanks to Jennifer for proposing a patch and Tamar and Richard for digging >> into it. >> >>> On 27 Aug 2024, at 13:16, Tamar Christina <tamar.christ...@arm.com> wrote: >>> >>> External email: Use caution opening links or attachments >>> >>> >>>> -----Original Message----- >>>> From: Richard Sandiford <richard.sandif...@arm.com> >>>> Sent: Tuesday, August 27, 2024 11:46 AM >>>> To: Tamar Christina <tamar.christ...@arm.com> >>>> Cc: Jennifer Schmitz <jschm...@nvidia.com>; gcc-patches@gcc.gnu.org; Kyrylo >>>> Tkachov <ktkac...@exchange.nvidia.com> >>>> Subject: Re: [RFC][PATCH] AArch64: Remove >>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>> >>>> Tamar Christina <tamar.christ...@arm.com> writes: >>>>> Hi Jennifer, >>>>> >>>>>> -----Original Message----- >>>>>> From: Jennifer Schmitz <jschm...@nvidia.com> >>>>>> Sent: Friday, August 23, 2024 1:07 PM >>>>>> To: gcc-patches@gcc.gnu.org >>>>>> Cc: Richard Sandiford <richard.sandif...@arm.com>; Kyrylo Tkachov >>>>>> <ktkac...@exchange.nvidia.com> >>>>>> Subject: [RFC][PATCH] AArch64: Remove >>>>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>>>> >>>>>> This patch removes the AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS >>>>>> tunable and >>>>>> use_new_vector_costs entry in aarch64-tuning-flags.def and makes the >>>>>> AARCH64_EXTRA_TUNE_USE_NEW_VECTOR_COSTS paths in the backend >> the >>>>>> default. >>>> >>>> Thanks for doing this. This has been on my TODO list ever since the >>>> tunable was added. >>>> >>>> The history is that these "new" costs were originally added in stage 4 >>>> of GCC 11 for Neoverse V1. Since the costs were added so late, it wasn't >>>> appropriate to change the behaviour for any other core. All the new code >>>> was therefore gated on this option. >>>> >>>> The new costs do two main things: >>>> >>>> (1) use throughput-based calculations where available, including to choose >>>> between Advanced SIMD and SVE >>>> >>>> (2) try to make the latency-based costs more precise, by looking more >>>> closely >>>> at the provided stmt_info >>>> >>>> Old cost models won't be affected by (1) either way, since they don't >>>> provide any throughput information. But they should in principle benefit >>>> from (2). So... >>>> >>>>>> To that end, the function aarch64_use_new_vector_costs_p and its uses >> were >>>>>> removed. Additionally, guards were added prevent nullpointer dereferences >> of >>>>>> fields in cpu_vector_cost. >>>>>> >>>>> >>>>> I'm not against this change, but it does mean that we now switch old Adv. >> SIMD >>>>> cost models as well to the new throughput based cost models. That means >> that >>>>> -mcpu=generic now behaves differently, and -mcpu=neoverse-n1 and I think >>>>> some distros explicitly use this (I believe yocto for instance does). >>>> >>>> ...it shouldn't mean that we start using throughput-based models for >>>> cortexa53 etc., since there's no associated issue info. >>> >>> Yes, I was using throughput based model as a name. But as you indicated in >>> (2) >>> it does change the latency calculation. >>> >>> My question was because of things in e.g. aarch64_adjust_stmt_cost and >> friends, >>> e.g. aarch64_multiply_add_p changes the cost between FMA SIMD vs scalar. >>> >>> So my question.. >>> >>>> >>>>> Have we validated that the old generic cost model still behaves sensibly >>>>> with >> this >>>> change? >>> >>> is still valid I think, we *are* changing the cost for all models, >>> and while they should indeed be more accurate, there could be knock on >>> effects. >>> >> >> We can run SPEC on a Grace system with -mcpu=generic to see what the effect >> is, >> but wider benchmarking would be more appropriate. Can you help with that >> Tamar once we agree on the other implementation details in this patch? >> > > Sure that's not a problem. Just ping me when you have a patch you want me to > test :) > >> >>> Thanks, >>> Tamar >>> >>>>> >>>>>> The patch was bootstrapped and regtested on aarch64-linux-gnu: >>>>>> No problems bootstrapping, but several test files (in aarch64-sve.exp: >>>>>> gather_load_extend_X.c >>>>>> where X is 1 to 4, strided_load_2.c, strided_store_2.c) fail because of >>>>>> small >>>>>> differences >>>>>> in codegen that make some of the scan-assembler-times tests fail. >>>>>> >>>>>> Kyrill suggested to add a -fvect-cost-model=unlimited flag to these >>>>>> tests and >>>> add >>>>>> some >>>>> >>>>> I don't personally like unlimited here as unlimited means just vectorize >>>>> at any >>>>> cost. This means that costing between modes are also disabled. A lot of >>>>> these >>>>> testcases are intended to test not just that we vectorize but that we >>>>> vectorize >>>>> with efficient code. >>>>> >>>>> I'd prefer to use -fvect-cost-model=dynamic if that fixes the testcases. >>>> >>>> Yeah, I don't think -fvect-cost-model=unlimited would work for the >>>> gather_load_extend_X.c tests, since we require the cost model to decide >>>> when to use extending loads vs loading a full vector and unpacking. >> >> I had suggested using -fvect-cost-model=unlimited here because I thought >> these >> tests wanted to test the capability of GCC to detect and generate the >> particular SVE >> feature (gather load extends) for all supported data types regardless of >> profitability. > > I think the problem only specifically for the load_extend tests, because > unlimited also > disables SVE vs SVE comparisons wrt to changes to VF, so not just Adv. SIMD > vs SVE. > That is while it would generate a gather, it may choose to instead of gather > from > B -> D, do B -> S and then extend the results. Because this has a higher VF. > > Without the cross SVE mode comparisons it wouldn't know that the extensions > would > actually slow it down. > >> If the tests are intended to also make the right profitability decisions for >> the generic >> tuning model then I agree using -fvect-cost-model=unlimited is not >> appropriate >> here, though I do think that it’d be useful to fix the backend vector cost >> model >> hooks to honor -fvect-cost-model=unlimited and not limit generation of >> gather/scatter in that case. What it should do for the SVE vs Neon decisions >> is an >> open question. >> >> >>>> >>>> [...tries patch...] >>>> >>>> It seems like three main things are contributing to the difference: >>>> >>>> 1. we now cost 0 for a scalar prologue extension from a loaded value >>>> 2. we now cost gathers & scatters based on gather_load_x32/x64_cost >>>> 3. we now apply a large one-off cost for gathers (Tamar's recent change) >>>> >>>> (1) is expected. >>>> >>>> (2) partly looks like a latent bug. We're using the x32 cost for >>>> VNx2QI->VNx2SI, even though those are really .B->.D loads. >>>> >>>> @@ -16819,7 +16811,7 @@ aarch64_detect_vector_stmt_subtype (vec_info >>>> *vinfo, vect_cost_for_stmt kind, >>>> && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == >>>> VMAT_GATHER_SCATTER) >>>> { >>>> unsigned int nunits = vect_nunits_for_cost (vectype); >>>> - if (GET_MODE_UNIT_BITSIZE (TYPE_MODE (vectype)) == 64) >>>> + if (known_eq (GET_MODE_NUNITS (TYPE_MODE (vectype)), >>>> aarch64_sve_vg)) >>>> return { sve_costs->gather_load_x64_cost, nunits }; >>>> return { sve_costs->gather_load_x32_cost, nunits }; >>>> } >>>> >>>> fixes that. >> >> Would you be willing to test and push that to trunk to get it out of the way? >> >> >>>> >>>> (3) is interesting. generic_vector_cost isn't used by default for any >>>> SVE CPU, or any -march=...+sve. So the question is: should we treat it >>>> as "architectural intent"/somewhat idealised? Or should we try to make >>>> it produce good code for existing SVE cores, in which case it would >>>> overlap quite a lot with generic_armv8_a and generic_armv9_a. >> >>>> >>>> If the former, we could set gather_load_x32_init_cost and >>>> gather_load_x64_init_cost to 0 for generic_sve_vector_cost >>>> (and nothing else, so that generic_armv*_a are unaffected). >> >> I don’t have strong opinions on this point but if Tamar is concerned about >> deviating too much from the known-good Advanced SIMD generic tuning we have >> now then we should aim for minimal codegen changes in that? > > No I'm fine with Richard's suggestion. The only way you'd be able to get > this model > to fire is with -march=xxx=sve -mtune=generic, at which point I'm fine with > assuming > gathers have no initial overhead.
As an aside, Jennifer pointed out to me that running the aarch64-sve.exp tests ends up using “-march=armv8.2-a+sve -mtune=generic -moverride=tune=none” for the tests but aarch64-sve.exp does’t add that -mtune and -moverride. It seems that aarch64-sve-acle-asm.exp does add these flags (to avoid the CSE_SVE_VL_CONSTANTS logic IIRC), but it seems odd to use them in the top-level aarch64-sve.exp and I’m not sure by that DejaGNU mechanism they end up propagating there. Thanks, Kyrill > > Cheers, > Tamar > >> Thanks again, >> Kyrill >> >> >>>> >>>> On the patch: >>>> >>>>> @@ -16733,7 +16723,8 @@ aarch64_in_loop_reduction_latency (vec_info >>>> *vinfo, stmt_vec_info stmt_info, >>>>> { >>>>> const cpu_vector_cost *vec_costs = aarch64_tune_params.vec_costs; >>>>> const sve_vec_cost *sve_costs = nullptr; >>>>> - if (vec_flags & VEC_ANY_SVE) >>>>> + if (vec_costs->sve >>>>> + && vec_flags & VEC_ANY_SVE) >>>>> sve_costs = aarch64_tune_params.vec_costs->sve; >>>> >>>> This doesn't seem necessary. I agree we might as well reuse the >>>> vec_costs local variable in the "if" body though. >>>> >>>>> [...] >>>>> @@ -16794,9 +16785,10 @@ aarch64_detect_vector_stmt_subtype >> (vec_info >>>> *vinfo, vect_cost_for_stmt kind, >>>>> fractional_cost stmt_cost) >>>>> { >>>>> const simd_vec_cost *simd_costs = aarch64_simd_vec_costs (vectype); >>>>> + const cpu_vector_cost *vec_costs = aarch64_tune_params.vec_costs; >>>>> const sve_vec_cost *sve_costs = nullptr; >>>>> - if (aarch64_sve_mode_p (TYPE_MODE (vectype))) >>>>> - sve_costs = aarch64_tune_params.vec_costs->sve; >>>>> + if (aarch64_sve_mode_p (TYPE_MODE (vectype)) && vec_costs->sve) >>>>> + sve_costs = vec_costs->sve; >>>> >>>> Similarly here, this doesn't seem necessary. >>>> >>>>> [...] >>>>> @@ -17301,11 +17293,13 @@ aarch64_vector_costs::add_stmt_cost (int >>>> count, vect_cost_for_stmt kind, >>>>> where, stmt_cost); >>>> >>>>> /* Check if we've seen an SVE gather/scatter operation and which >>>>> size. */ >>>>> + const cpu_vector_cost *vec_costs = aarch64_tune_params.vec_costs; >>>>> if (kind == scalar_load >>>>> && aarch64_sve_mode_p (TYPE_MODE (vectype)) >>>>> - && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == >>>> VMAT_GATHER_SCATTER) >>>>> + && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == >>>> VMAT_GATHER_SCATTER >>>>> + && vec_costs->sve) >>>>> { >>>>> - const sve_vec_cost *sve_costs = aarch64_tune_params.vec_costs->sve; >>>>> + const sve_vec_cost *sve_costs = vec_costs->sve; >>>>> if (sve_costs) >>>>> { >>>>> if (GET_MODE_UNIT_BITSIZE (TYPE_MODE (vectype)) == 64) >>>> >>>> Here too. >>>> >>>> Thanks, >>>> Richard >