> On 28 Aug 2024, at 11:04, Richard Sandiford <richard.sandif...@arm.com> wrote: > > External email: Use caution opening links or attachments > > > Tamar Christina <tamar.christ...@arm.com> writes: >>>>>>> The patch was bootstrapped and regtested on aarch64-linux-gnu: >>>>>>> No problems bootstrapping, but several test files (in aarch64-sve.exp: >>>>>>> gather_load_extend_X.c >>>>>>> where X is 1 to 4, strided_load_2.c, strided_store_2.c) fail because of >>>>>>> small >>>>>>> differences >>>>>>> in codegen that make some of the scan-assembler-times tests fail. >>>>>>> >>>>>>> Kyrill suggested to add a -fvect-cost-model=unlimited flag to these >>>>>>> tests and >>>>> add >>>>>>> some >>>>>> >>>>>> I don't personally like unlimited here as unlimited means just vectorize >>>>>> at any >>>>>> cost. This means that costing between modes are also disabled. A lot of >>>>>> these >>>>>> testcases are intended to test not just that we vectorize but that we >>>>>> vectorize >>>>>> with efficient code. >>>>>> >>>>>> I'd prefer to use -fvect-cost-model=dynamic if that fixes the testcases. >>>>> >>>>> Yeah, I don't think -fvect-cost-model=unlimited would work for the >>>>> gather_load_extend_X.c tests, since we require the cost model to decide >>>>> when to use extending loads vs loading a full vector and unpacking. >>> >>> I had suggested using -fvect-cost-model=unlimited here because I thought >>> these >>> tests wanted to test the capability of GCC to detect and generate the >>> particular SVE >>> feature (gather load extends) for all supported data types regardless of >>> profitability. >> >> I think the problem only specifically for the load_extend tests, because >> unlimited also >> disables SVE vs SVE comparisons wrt to changes to VF, so not just Adv. SIMD >> vs SVE. >> That is while it would generate a gather, it may choose to instead of gather >> from >> B -> D, do B -> S and then extend the results. Because this has a higher VF. >> >> Without the cross SVE mode comparisons it wouldn't know that the extensions >> would >> actually slow it down. > > Right. > >>> If the tests are intended to also make the right profitability decisions >>> for the generic >>> tuning model then I agree using -fvect-cost-model=unlimited is not >>> appropriate >>> here, though I do think that it’d be useful to fix the backend vector cost >>> model >>> hooks to honor -fvect-cost-model=unlimited and not limit generation of >>> gather/scatter in that case. What it should do for the SVE vs Neon >>> decisions is an >>> open question. > > FTR, most of the tests are instead testing specific code generation > strategies, regardless of whether those strategies are good for > particular real cores. So forcing the choice is ok in principle. > It's just that -fvect-cost-model=unlimited wouldn't do that in this case. > >>>>> [...tries patch...] >>>>> >>>>> It seems like three main things are contributing to the difference: >>>>> >>>>> 1. we now cost 0 for a scalar prologue extension from a loaded value >>>>> 2. we now cost gathers & scatters based on gather_load_x32/x64_cost >>>>> 3. we now apply a large one-off cost for gathers (Tamar's recent change) >>>>> >>>>> (1) is expected. >>>>> >>>>> (2) partly looks like a latent bug. We're using the x32 cost for >>>>> VNx2QI->VNx2SI, even though those are really .B->.D loads. >>>>> >>>>> @@ -16819,7 +16811,7 @@ aarch64_detect_vector_stmt_subtype (vec_info >>>>> *vinfo, vect_cost_for_stmt kind, >>>>> && STMT_VINFO_MEMORY_ACCESS_TYPE (stmt_info) == >>>>> VMAT_GATHER_SCATTER) >>>>> { >>>>> unsigned int nunits = vect_nunits_for_cost (vectype); >>>>> - if (GET_MODE_UNIT_BITSIZE (TYPE_MODE (vectype)) == 64) >>>>> + if (known_eq (GET_MODE_NUNITS (TYPE_MODE (vectype)), >>>>> aarch64_sve_vg)) >>>>> return { sve_costs->gather_load_x64_cost, nunits }; >>>>> return { sve_costs->gather_load_x32_cost, nunits }; >>>>> } >>>>> >>>>> fixes that. >>> >>> Would you be willing to test and push that to trunk to get it out of the >>> way? > > Sure. > >>>>> (3) is interesting. generic_vector_cost isn't used by default for any >>>>> SVE CPU, or any -march=...+sve. So the question is: should we treat it >>>>> as "architectural intent"/somewhat idealised? Or should we try to make >>>>> it produce good code for existing SVE cores, in which case it would >>>>> overlap quite a lot with generic_armv8_a and generic_armv9_a. >>> >>>>> >>>>> If the former, we could set gather_load_x32_init_cost and >>>>> gather_load_x64_init_cost to 0 for generic_sve_vector_cost >>>>> (and nothing else, so that generic_armv*_a are unaffected). >>> >>> I don’t have strong opinions on this point but if Tamar is concerned about >>> deviating too much from the known-good Advanced SIMD generic tuning we have >>> now then we should aim for minimal codegen changes in that? >> >> No I'm fine with Richard's suggestion. The only way you'd be able to get >> this model >> to fire is with -march=xxx=sve -mtune=generic, at which point I'm fine with >> assuming >> gathers have no initial overhead. > > OK, great. Should I do this as well? I suppose it's a separate patch > from removing the "new cost model" flag. >
That would be great. Thanks! Kyrill > Richard