Soumya AR <soum...@nvidia.com> writes: >> On 1 Jul 2025, at 9:22 PM, Kyrylo Tkachov <ktkac...@nvidia.com> wrote: >> >> >> >>> On 1 Jul 2025, at 17:36, Richard Sandiford <richard.sandif...@arm.com> >>> wrote: >>> >>> Soumya AR <soum...@nvidia.com> writes: >>>> From 2a2c3e3683aaf3041524df166fc6f8cf20895a0b Mon Sep 17 00:00:00 2001 >>>> From: Soumya AR <soum...@nvidia.com> >>>> Date: Mon, 30 Jun 2025 12:17:30 -0700 >>>> Subject: [PATCH] aarch64: Enable selective LDAPUR generation for cores with >>>> RCPC2 >>>> >>>> This patch adds the ability to fold the address computation into the >>>> addressing >>>> mode for LDAPR instructions using LDAPUR when RCPC2 is available. >>>> >>>> LDAPUR emission is controlled by the tune flag enable_ldapur, to enable it >>>> on a >>>> per-core basis. Earlier, the following code: >>>> >>>> uint64_t >>>> foo (std::atomic<uint64_t> *x) >>>> { >>>> return x[1].load(std::memory_order_acquire); >>>> } >>>> >>>> would generate: >>>> >>>> foo(std::atomic<unsigned long>*): >>>> add x0, x0, 8 >>>> ldapr x0, [x0] >>>> ret >>>> >>>> but now generates: >>>> >>>> foo(std::atomic<unsigned long>*): >>>> ldapur x0, [x0, 8] >>>> ret >>>> >>>> The patch was bootstrapped and regtested on aarch64-linux-gnu, no >>>> regression. >>>> OK for mainline? >>>> >>>> Signed-off-by: Soumya AR <soum...@nvidia.com> >>>> >>>> gcc/ChangeLog: >>>> >>>> * config/aarch64/aarch64-tuning-flags.def (AARCH64_EXTRA_TUNING_OPTION): >>>> Add the enable_ldapur flag to conwtrol LDAPUR emission. >>>> * config/aarch64/aarch64.h (TARGET_ENABLE_LDAPUR): Use new flag. >>>> * config/aarch64/aarch64.md (any): Add ldapur_enable attribute. >>>> * config/aarch64/atomics.md: (aarch64_atomic_load<mode>_rcpc): Modify >>>> to emit LDAPUR for cores with RCPC2 when enable_ldapur is set. >>>> (*aarch64_atomic_load<ALLX:mode>_rcpc_zext): Likewise. >>>> (*aarch64_atomic_load<ALLX:mode>_rcpc_sext): Modified to emit LDAPURS >>>> for addressing with offsets. >>>> >>>> gcc/testsuite/ChangeLog: >>>> >>>> * gcc.target/aarch64/ldapur.c: New test. >>> >>> Thanks for doing this. It generally looks good, but a couple of comments >>> below: >>> >>>> --- >>>> gcc/config/aarch64/aarch64-tuning-flags.def | 2 + >>>> gcc/config/aarch64/aarch64.h | 5 ++ >>>> gcc/config/aarch64/aarch64.md | 11 +++- >>>> gcc/config/aarch64/atomics.md | 22 +++++--- >>>> gcc/testsuite/gcc.target/aarch64/ldapur.c | 61 +++++++++++++++++++++ >>>> 5 files changed, 92 insertions(+), 9 deletions(-) >>>> create mode 100644 gcc/testsuite/gcc.target/aarch64/ldapur.c >>>> >>>> diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def >>>> b/gcc/config/aarch64/aarch64-tuning-flags.def >>>> index f2c916e9d77..5bf54165306 100644 >>>> --- a/gcc/config/aarch64/aarch64-tuning-flags.def >>>> +++ b/gcc/config/aarch64/aarch64-tuning-flags.def >>>> @@ -44,6 +44,8 @@ AARCH64_EXTRA_TUNING_OPTION ("avoid_cross_loop_fma", >>>> AVOID_CROSS_LOOP_FMA) >>>> >>>> AARCH64_EXTRA_TUNING_OPTION ("fully_pipelined_fma", FULLY_PIPELINED_FMA) >>>> >>>> +AARCH64_EXTRA_TUNING_OPTION ("enable_ldapur", ENABLE_LDAPUR) >>>> + >>> >>> Let's see what others say, but personally, I think this would be better >>> as an opt-out, such as avoid_ldapur. The natural default seems to be to use >>> the extra addressing capacity when it's available and have CPUs explicitly >>> flag when they don't want that. >>> >>> A good, conservatively correct, default would probably be to add >>> avoid_ldapur >>> to every *current* CPU that includes rcpc2 and then separately remove it >>> from those that are known not to need it. In that sense, it's more work >>> for current CPUs than the current patch, but it should ease the impact >>> on future CPUs. >> >> LLVM used to do this folding by default everywhere until it was discovered >> that it hurts various CPUs. >> So they’ve taken the approach you describe, and disable the folding >> explicitly for: >> neoverse-v2 neoverse-v3 cortex-x3 cortex-x4 cortex-x925 >> I don’t know for sure if those are the only CPUs where this applies. >> They also disable the folding for generic tuning when -march is between >> armv8.4 - armv8.7/armv9.2. >> I guess we can do the same in GCC. > > Thanks for your suggestions, Richard and Kyrill. > > I've updated the patch to use avoid_ldapur. > > There's now an explicit override in aarch64_override_options_internal to use > avoid_ldapur for armv8.4 through armv8.7. > > I added it here because aarch64_adjust_generic_arch_tuning is only called for > generic_tunings and not generic_armv{8,9}_a_tunings. > > Let me know what you think.
Sounds good to me. I can see that we wouldn't want armv9.3-a+ generic tuning to be hampered by pre-armv9.3 cores. But I'm not sure that we're deliberately avoiding calling aarch64_adjust_generic_arch_tuning for generic_armv{8,9}_a_tunings. The current TARGET_SVE2 behaviour makes conceptual sense for generic_armv8_a_tunings too, if someone (unsually) used -march=armv8-a+sve2. The current TARGET_SVE2 behaviour would also be a nop for generic_armv9_a_tunings. It's probably more that the current SVE2 behaviour isn't particularly important for generic_armv{8,9}_a_tunings since the defaults are already correct for the common cases. So I think we should instead change it to: if (tune->tune == &generic_tunings || tune->tune == &generic_armv8_a_tunings || tune->tune == &generic_armv9_a_tunings) aarch64_adjust_generic_arch_tuning (aarch64_tune_params); and then add the LDAPUR override there, as you mentioned. Also: > + /* Avoid LDAPUR for generic tuning for armv8.4 through armv8.7/armv9.2. > */ > + if (AARCH64_HAVE_ISA(V8_4A) && !AARCH64_HAVE_ISA(V8_8A)) > + aarch64_tune_params.extra_tuning_flags |= > AARCH64_EXTRA_TUNE_AVOID_LDAPUR; it might be better to drop the Armv8.4-A minimum. If someone uses -march=armv8-a+rcpc2 then IMO we ought to avoid LDAPUR. > diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h > index e8bd8c73c12..c526594bd1b 100644 > --- a/gcc/config/aarch64/aarch64.h > +++ b/gcc/config/aarch64/aarch64.h > @@ -490,6 +490,11 @@ constexpr auto AARCH64_FL_DEFAULT_ISA_MODE > ATTRIBUTE_UNUSED > (bool (aarch64_tune_params.extra_tuning_flags \ > & AARCH64_EXTRA_TUNE_CHEAP_FPMR_WRITE)) > +/* Avoid folding address computation into LDAPUR for specifc cores. */ > +#define TARGET_AVOID_LDAPUR (TARGET_RCPC2 \ > + && (aarch64_tune_params.extra_tuning_flags \ > + & AARCH64_EXTRA_TUNE_AVOID_LDAPUR)) > + > /* Combinatorial tests. */ > #define TARGET_SVE2_OR_SME2 \ > diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md > index e11e13033d2..a9ea3cb0096 100644 > --- a/gcc/config/aarch64/aarch64.md > +++ b/gcc/config/aarch64/aarch64.md > @@ -482,6 +482,8 @@ > ;; clobber for SVE predicates. > (define_attr "pred_clobber" "any,no,yes" (const_string "any")) > > +(define_attr "avoid_ldapur" "any,no,yes" (const_string "any")) > + > ;; [For compatibility with Arm in pipeline models] > ;; Attribute that specifies whether or not the instruction touches fp > ;; registers. > @@ -506,7 +508,14 @@ > (eq_attr "pred_clobber" "yes") > (match_test "TARGET_SVE_PRED_CLOBBER")) > (eq_attr "pred_clobber" "any")) > - > + (ior > + (and > + (eq_attr "avoid_ldapur" "yes") > + (match_test "TARGET_AVOID_LDAPUR")) > + (and > + (eq_attr "avoid_ldapur" "no") > + (match_test "!TARGET_AVOID_LDAPUR")) > + (eq_attr "avoid_ldapur" "any")) > (ior > (eq_attr "arch" "any") Sorry for the runaround, but: when I mentioned avoid_ldapur last time, I only meant that for the tuning structure. The idea was to make it an opt-out rather than opt-in for core-specific tunings, so that we don't require every future core to explicitly say that it wants LDAPUR. I think we should stick to TARGET_ENABLE_LDAPUR for the macro and "enable_ldapur" for the md attribute. Thanks, Richard