> On 10 Jul 2025, at 3:15 PM, Richard Sandiford <richard.sandif...@arm.com> > wrote: > > External email: Use caution opening links or attachments > > > Soumya AR <soum...@nvidia.com> writes: >>> On 1 Jul 2025, at 9:22 PM, Kyrylo Tkachov <ktkac...@nvidia.com> wrote: >>> >>> >>> >>>> On 1 Jul 2025, at 17:36, Richard Sandiford <richard.sandif...@arm.com> >>>> wrote: >>>> >>>> Soumya AR <soum...@nvidia.com> writes: >>>>> From 2a2c3e3683aaf3041524df166fc6f8cf20895a0b Mon Sep 17 00:00:00 2001 >>>>> From: Soumya AR <soum...@nvidia.com> >>>>> Date: Mon, 30 Jun 2025 12:17:30 -0700 >>>>> Subject: [PATCH] aarch64: Enable selective LDAPUR generation for cores >>>>> with >>>>> RCPC2 >>>>> >>>>> This patch adds the ability to fold the address computation into the >>>>> addressing >>>>> mode for LDAPR instructions using LDAPUR when RCPC2 is available. >>>>> >>>>> LDAPUR emission is controlled by the tune flag enable_ldapur, to enable >>>>> it on a >>>>> per-core basis. Earlier, the following code: >>>>> >>>>> uint64_t >>>>> foo (std::atomic<uint64_t> *x) >>>>> { >>>>> return x[1].load(std::memory_order_acquire); >>>>> } >>>>> >>>>> would generate: >>>>> >>>>> foo(std::atomic<unsigned long>*): >>>>> add x0, x0, 8 >>>>> ldapr x0, [x0] >>>>> ret >>>>> >>>>> but now generates: >>>>> >>>>> foo(std::atomic<unsigned long>*): >>>>> ldapur x0, [x0, 8] >>>>> ret >>>>> >>>>> The patch was bootstrapped and regtested on aarch64-linux-gnu, no >>>>> regression. >>>>> OK for mainline? >>>>> >>>>> Signed-off-by: Soumya AR <soum...@nvidia.com> >>>>> >>>>> gcc/ChangeLog: >>>>> >>>>> * config/aarch64/aarch64-tuning-flags.def (AARCH64_EXTRA_TUNING_OPTION): >>>>> Add the enable_ldapur flag to conwtrol LDAPUR emission. >>>>> * config/aarch64/aarch64.h (TARGET_ENABLE_LDAPUR): Use new flag. >>>>> * config/aarch64/aarch64.md (any): Add ldapur_enable attribute. >>>>> * config/aarch64/atomics.md: (aarch64_atomic_load<mode>_rcpc): Modify >>>>> to emit LDAPUR for cores with RCPC2 when enable_ldapur is set. >>>>> (*aarch64_atomic_load<ALLX:mode>_rcpc_zext): Likewise. >>>>> (*aarch64_atomic_load<ALLX:mode>_rcpc_sext): Modified to emit LDAPURS >>>>> for addressing with offsets. >>>>> >>>>> gcc/testsuite/ChangeLog: >>>>> >>>>> * gcc.target/aarch64/ldapur.c: New test. >>>> >>>> Thanks for doing this. It generally looks good, but a couple of comments >>>> below: >>>> >>>>> --- >>>>> gcc/config/aarch64/aarch64-tuning-flags.def | 2 + >>>>> gcc/config/aarch64/aarch64.h | 5 ++ >>>>> gcc/config/aarch64/aarch64.md | 11 +++- >>>>> gcc/config/aarch64/atomics.md | 22 +++++--- >>>>> gcc/testsuite/gcc.target/aarch64/ldapur.c | 61 +++++++++++++++++++++ >>>>> 5 files changed, 92 insertions(+), 9 deletions(-) >>>>> create mode 100644 gcc/testsuite/gcc.target/aarch64/ldapur.c >>>>> >>>>> diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def >>>>> b/gcc/config/aarch64/aarch64-tuning-flags.def >>>>> index f2c916e9d77..5bf54165306 100644 >>>>> --- a/gcc/config/aarch64/aarch64-tuning-flags.def >>>>> +++ b/gcc/config/aarch64/aarch64-tuning-flags.def >>>>> @@ -44,6 +44,8 @@ AARCH64_EXTRA_TUNING_OPTION ("avoid_cross_loop_fma", >>>>> AVOID_CROSS_LOOP_FMA) >>>>> >>>>> AARCH64_EXTRA_TUNING_OPTION ("fully_pipelined_fma", FULLY_PIPELINED_FMA) >>>>> >>>>> +AARCH64_EXTRA_TUNING_OPTION ("enable_ldapur", ENABLE_LDAPUR) >>>>> + >>>> >>>> Let's see what others say, but personally, I think this would be better >>>> as an opt-out, such as avoid_ldapur. The natural default seems to be to >>>> use >>>> the extra addressing capacity when it's available and have CPUs explicitly >>>> flag when they don't want that. >>>> >>>> A good, conservatively correct, default would probably be to add >>>> avoid_ldapur >>>> to every *current* CPU that includes rcpc2 and then separately remove it >>>> from those that are known not to need it. In that sense, it's more work >>>> for current CPUs than the current patch, but it should ease the impact >>>> on future CPUs. >>> >>> LLVM used to do this folding by default everywhere until it was discovered >>> that it hurts various CPUs. >>> So they’ve taken the approach you describe, and disable the folding >>> explicitly for: >>> neoverse-v2 neoverse-v3 cortex-x3 cortex-x4 cortex-x925 >>> I don’t know for sure if those are the only CPUs where this applies. >>> They also disable the folding for generic tuning when -march is between >>> armv8.4 - armv8.7/armv9.2. >>> I guess we can do the same in GCC. >> >> Thanks for your suggestions, Richard and Kyrill. >> >> I've updated the patch to use avoid_ldapur. >> >> There's now an explicit override in aarch64_override_options_internal to use >> avoid_ldapur for armv8.4 through armv8.7. >> >> I added it here because aarch64_adjust_generic_arch_tuning is only called >> for generic_tunings and not generic_armv{8,9}_a_tunings. >> >> Let me know what you think. > > Sounds good to me. I can see that we wouldn't want armv9.3-a+ generic > tuning to be hampered by pre-armv9.3 cores. > > But I'm not sure that we're deliberately avoiding calling > aarch64_adjust_generic_arch_tuning for generic_armv{8,9}_a_tunings. > The current TARGET_SVE2 behaviour makes conceptual sense for > generic_armv8_a_tunings too, if someone (unsually) used > -march=armv8-a+sve2. The current TARGET_SVE2 behaviour would > also be a nop for generic_armv9_a_tunings. > > It's probably more that the current SVE2 behaviour isn't particularly > important for generic_armv{8,9}_a_tunings since the defaults are already > correct for the common cases. > > So I think we should instead change it to: > > if (tune->tune == &generic_tunings > || tune->tune == &generic_armv8_a_tunings > || tune->tune == &generic_armv9_a_tunings) > aarch64_adjust_generic_arch_tuning (aarch64_tune_params); > > and then add the LDAPUR override there, as you mentioned. Also: > >> + /* Avoid LDAPUR for generic tuning for armv8.4 through armv8.7/armv9.2. >> */ >> + if (AARCH64_HAVE_ISA(V8_4A) && !AARCH64_HAVE_ISA(V8_8A)) >> + aarch64_tune_params.extra_tuning_flags |= >> AARCH64_EXTRA_TUNE_AVOID_LDAPUR; > > it might be better to drop the Armv8.4-A minimum. If someone uses > -march=armv8-a+rcpc2 then IMO we ought to avoid LDAPUR. > >> diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h >> index e8bd8c73c12..c526594bd1b 100644 >> --- a/gcc/config/aarch64/aarch64.h >> +++ b/gcc/config/aarch64/aarch64.h >> @@ -490,6 +490,11 @@ constexpr auto AARCH64_FL_DEFAULT_ISA_MODE >> ATTRIBUTE_UNUSED >> (bool (aarch64_tune_params.extra_tuning_flags \ >> & AARCH64_EXTRA_TUNE_CHEAP_FPMR_WRITE)) > >> +/* Avoid folding address computation into LDAPUR for specifc cores. */ >> +#define TARGET_AVOID_LDAPUR (TARGET_RCPC2 \ >> + && (aarch64_tune_params.extra_tuning_flags \ >> + & AARCH64_EXTRA_TUNE_AVOID_LDAPUR)) >> + >> /* Combinatorial tests. */ > >> #define TARGET_SVE2_OR_SME2 \ >> diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md >> index e11e13033d2..a9ea3cb0096 100644 >> --- a/gcc/config/aarch64/aarch64.md >> +++ b/gcc/config/aarch64/aarch64.md >> @@ -482,6 +482,8 @@ >> ;; clobber for SVE predicates. >> (define_attr "pred_clobber" "any,no,yes" (const_string "any")) >> >> +(define_attr "avoid_ldapur" "any,no,yes" (const_string "any")) >> + >> ;; [For compatibility with Arm in pipeline models] >> ;; Attribute that specifies whether or not the instruction touches fp >> ;; registers. >> @@ -506,7 +508,14 @@ >> (eq_attr "pred_clobber" "yes") >> (match_test "TARGET_SVE_PRED_CLOBBER")) >> (eq_attr "pred_clobber" "any")) >> - >> + (ior >> + (and >> + (eq_attr "avoid_ldapur" "yes") >> + (match_test "TARGET_AVOID_LDAPUR")) >> + (and >> + (eq_attr "avoid_ldapur" "no") >> + (match_test "!TARGET_AVOID_LDAPUR")) >> + (eq_attr "avoid_ldapur" "any")) >> (ior >> (eq_attr "arch" "any") > > Sorry for the runaround, but: when I mentioned avoid_ldapur last time, > I only meant that for the tuning structure. The idea was to make it > an opt-out rather than opt-in for core-specific tunings, so that we > don't require every future core to explicitly say that it wants LDAPUR. > > I think we should stick to TARGET_ENABLE_LDAPUR for the macro and > "enable_ldapur" for the md attribute. >
Hi Richard, Thanks for the suggestions. Before I respin the patch, could you please clarify if you mean: a) have the macro as TARGET_ENABLE_LDAPUR and the md attribute as enable_ldapur and disable selective cores/arches using !TARGET_ENABLE_LDAPUR or b) have enable_ldapur for the md attribute and TARGET_AVOID_LDAPUR as the macro and handle the condition accordingly in the md files? Sorry, just wanted to confirm because the wording from your earlier review is confusing me: >>>> Let's see what others say, but personally, I think this would be better >>>> as an opt-out, such as avoid_ldapur. The natural default seems to be to >>>> use >>>> the extra addressing capacity when it's available and have CPUs explicitly >>>> flag when they don't want that. Thanks, Soumya > Thanks, > Richard