Re: [PATCH] aarch64: Enable selective LDAPUR generation for cores with RCPC2

Soumya AR Thu, 10 Jul 2025 04:24:42 -0700


> On 10 Jul 2025, at 3:15 PM, Richard Sandiford <richard.sandif...@arm.com> 
> wrote:
> 
> External email: Use caution opening links or attachments
> 
> 
> Soumya AR <soum...@nvidia.com> writes:
>>> On 1 Jul 2025, at 9:22 PM, Kyrylo Tkachov <ktkac...@nvidia.com> wrote:
>>> 
>>> 
>>> 
>>>> On 1 Jul 2025, at 17:36, Richard Sandiford <richard.sandif...@arm.com> 
>>>> wrote:
>>>> 
>>>> Soumya AR <soum...@nvidia.com> writes:
>>>>> From 2a2c3e3683aaf3041524df166fc6f8cf20895a0b Mon Sep 17 00:00:00 2001
>>>>> From: Soumya AR <soum...@nvidia.com>
>>>>> Date: Mon, 30 Jun 2025 12:17:30 -0700
>>>>> Subject: [PATCH] aarch64: Enable selective LDAPUR generation for cores 
>>>>> with
>>>>> RCPC2
>>>>> 
>>>>> This patch adds the ability to fold the address computation into the 
>>>>> addressing
>>>>> mode for LDAPR instructions using LDAPUR when RCPC2 is available.
>>>>> 
>>>>> LDAPUR emission is controlled by the tune flag enable_ldapur, to enable 
>>>>> it on a
>>>>> per-core basis. Earlier, the following code:
>>>>> 
>>>>> uint64_t
>>>>> foo (std::atomic<uint64_t> *x)
>>>>> {
>>>>> return x[1].load(std::memory_order_acquire);
>>>>> }
>>>>> 
>>>>> would generate:
>>>>> 
>>>>> foo(std::atomic<unsigned long>*):
>>>>> add     x0, x0, 8
>>>>> ldapr   x0, [x0]
>>>>> ret
>>>>> 
>>>>> but now generates:
>>>>> 
>>>>> foo(std::atomic<unsigned long>*):
>>>>> ldapur  x0, [x0, 8]
>>>>> ret
>>>>> 
>>>>> The patch was bootstrapped and regtested on aarch64-linux-gnu, no 
>>>>> regression.
>>>>> OK for mainline?
>>>>> 
>>>>> Signed-off-by: Soumya AR <soum...@nvidia.com>
>>>>> 
>>>>> gcc/ChangeLog:
>>>>> 
>>>>> * config/aarch64/aarch64-tuning-flags.def (AARCH64_EXTRA_TUNING_OPTION):
>>>>> Add the enable_ldapur flag to conwtrol LDAPUR emission.
>>>>> * config/aarch64/aarch64.h (TARGET_ENABLE_LDAPUR): Use new flag.
>>>>> * config/aarch64/aarch64.md (any): Add ldapur_enable attribute.
>>>>> * config/aarch64/atomics.md: (aarch64_atomic_load<mode>_rcpc): Modify
>>>>> to emit LDAPUR for cores with RCPC2 when enable_ldapur is set.
>>>>> (*aarch64_atomic_load<ALLX:mode>_rcpc_zext): Likewise.
>>>>> (*aarch64_atomic_load<ALLX:mode>_rcpc_sext): Modified to emit LDAPURS
>>>>> for addressing with offsets.
>>>>> 
>>>>> gcc/testsuite/ChangeLog:
>>>>> 
>>>>> * gcc.target/aarch64/ldapur.c: New test.
>>>> 
>>>> Thanks for doing this.  It generally looks good, but a couple of comments
>>>> below:
>>>> 
>>>>> ---
>>>>> gcc/config/aarch64/aarch64-tuning-flags.def |  2 +
>>>>> gcc/config/aarch64/aarch64.h                |  5 ++
>>>>> gcc/config/aarch64/aarch64.md               | 11 +++-
>>>>> gcc/config/aarch64/atomics.md               | 22 +++++---
>>>>> gcc/testsuite/gcc.target/aarch64/ldapur.c   | 61 +++++++++++++++++++++
>>>>> 5 files changed, 92 insertions(+), 9 deletions(-)
>>>>> create mode 100644 gcc/testsuite/gcc.target/aarch64/ldapur.c
>>>>> 
>>>>> diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def 
>>>>> b/gcc/config/aarch64/aarch64-tuning-flags.def
>>>>> index f2c916e9d77..5bf54165306 100644
>>>>> --- a/gcc/config/aarch64/aarch64-tuning-flags.def
>>>>> +++ b/gcc/config/aarch64/aarch64-tuning-flags.def
>>>>> @@ -44,6 +44,8 @@ AARCH64_EXTRA_TUNING_OPTION ("avoid_cross_loop_fma", 
>>>>> AVOID_CROSS_LOOP_FMA)
>>>>> 
>>>>> AARCH64_EXTRA_TUNING_OPTION ("fully_pipelined_fma", FULLY_PIPELINED_FMA)
>>>>> 
>>>>> +AARCH64_EXTRA_TUNING_OPTION ("enable_ldapur", ENABLE_LDAPUR)
>>>>> +
>>>> 
>>>> Let's see what others say, but personally, I think this would be better
>>>> as an opt-out, such as avoid_ldapur.  The natural default seems to be to 
>>>> use
>>>> the extra addressing capacity when it's available and have CPUs explicitly
>>>> flag when they don't want that.
>>>> 
>>>> A good, conservatively correct, default would probably be to add 
>>>> avoid_ldapur
>>>> to every *current* CPU that includes rcpc2 and then separately remove it
>>>> from those that are known not to need it.  In that sense, it's more work
>>>> for current CPUs than the current patch, but it should ease the impact
>>>> on future CPUs.
>>> 
>>> LLVM used to do this folding by default everywhere until it was discovered 
>>> that it hurts various CPUs.
>>> So they’ve taken the approach you describe, and disable the folding 
>>> explicitly for:
>>> neoverse-v2 neoverse-v3 cortex-x3 cortex-x4 cortex-x925
>>> I don’t know for sure if those are the only CPUs where this applies.
>>> They also disable the folding for generic tuning when -march is between 
>>> armv8.4 - armv8.7/armv9.2.
>>> I guess we can do the same in GCC.
>> 
>> Thanks for your suggestions, Richard and Kyrill.
>> 
>> I've updated the patch to use avoid_ldapur.
>> 
>> There's now an explicit override in aarch64_override_options_internal to use 
>> avoid_ldapur for armv8.4 through armv8.7.
>> 
>> I added it here because aarch64_adjust_generic_arch_tuning is only called 
>> for generic_tunings and not generic_armv{8,9}_a_tunings.
>> 
>> Let me know what you think.
> 
> Sounds good to me.  I can see that we wouldn't want armv9.3-a+ generic
> tuning to be hampered by pre-armv9.3 cores.
> 
> But I'm not sure that we're deliberately avoiding calling
> aarch64_adjust_generic_arch_tuning for generic_armv{8,9}_a_tunings.
> The current TARGET_SVE2 behaviour makes conceptual sense for
> generic_armv8_a_tunings too, if someone (unsually) used
> -march=armv8-a+sve2.  The current TARGET_SVE2 behaviour would
> also be a nop for generic_armv9_a_tunings.
> 
> It's probably more that the current SVE2 behaviour isn't particularly
> important for generic_armv{8,9}_a_tunings since the defaults are already
> correct for the common cases.
> 
> So I think we should instead change it to:
> 
>  if (tune->tune == &generic_tunings
>      || tune->tune == &generic_armv8_a_tunings
>      || tune->tune == &generic_armv9_a_tunings)
>    aarch64_adjust_generic_arch_tuning (aarch64_tune_params);
> 
> and then add the LDAPUR override there, as you mentioned.  Also:
> 
>> +    /* Avoid LDAPUR for generic tuning for armv8.4 through armv8.7/armv9.2. 
>>  */
>> +    if (AARCH64_HAVE_ISA(V8_4A) && !AARCH64_HAVE_ISA(V8_8A))
>> +      aarch64_tune_params.extra_tuning_flags |= 
>> AARCH64_EXTRA_TUNE_AVOID_LDAPUR;
> 
> it might be better to drop the Armv8.4-A minimum.  If someone uses
> -march=armv8-a+rcpc2 then IMO we ought to avoid LDAPUR.
> 
>> diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
>> index e8bd8c73c12..c526594bd1b 100644
>> --- a/gcc/config/aarch64/aarch64.h
>> +++ b/gcc/config/aarch64/aarch64.h
>> @@ -490,6 +490,11 @@ constexpr auto AARCH64_FL_DEFAULT_ISA_MODE 
>> ATTRIBUTE_UNUSED
>>   (bool (aarch64_tune_params.extra_tuning_flags \
>>       & AARCH64_EXTRA_TUNE_CHEAP_FPMR_WRITE))
> 
>> +/* Avoid folding address computation into LDAPUR for specifc cores.  */
>> +#define TARGET_AVOID_LDAPUR (TARGET_RCPC2 \
>> +                          && (aarch64_tune_params.extra_tuning_flags \
>> +                              & AARCH64_EXTRA_TUNE_AVOID_LDAPUR))
>> +
>> /* Combinatorial tests.  */
> 
>> #define TARGET_SVE2_OR_SME2 \
>> diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
>> index e11e13033d2..a9ea3cb0096 100644
>> --- a/gcc/config/aarch64/aarch64.md
>> +++ b/gcc/config/aarch64/aarch64.md
>> @@ -482,6 +482,8 @@
>> ;; clobber for SVE predicates.
>> (define_attr "pred_clobber" "any,no,yes" (const_string "any"))
>> 
>> +(define_attr "avoid_ldapur" "any,no,yes" (const_string "any"))
>> +
>> ;; [For compatibility with Arm in pipeline models]
>> ;; Attribute that specifies whether or not the instruction touches fp
>> ;; registers.
>> @@ -506,7 +508,14 @@
>>        (eq_attr "pred_clobber" "yes")
>>        (match_test "TARGET_SVE_PRED_CLOBBER"))
>>      (eq_attr "pred_clobber" "any"))
>> -
>> +      (ior
>> +     (and
>> +       (eq_attr "avoid_ldapur" "yes")
>> +       (match_test "TARGET_AVOID_LDAPUR"))
>> +     (and
>> +       (eq_attr "avoid_ldapur" "no")
>> +       (match_test "!TARGET_AVOID_LDAPUR"))
>> +     (eq_attr "avoid_ldapur" "any"))
>>       (ior
>>      (eq_attr "arch" "any")
> 
> Sorry for the runaround, but: when I mentioned avoid_ldapur last time,
> I only meant that for the tuning structure.  The idea was to make it
> an opt-out rather than opt-in for core-specific tunings, so that we
> don't require every future core to explicitly say that it wants LDAPUR.
> 
> I think we should stick to TARGET_ENABLE_LDAPUR for the macro and
> "enable_ldapur" for the md attribute.
>


Hi Richard,

Thanks for the suggestions.

Before I respin the patch, could you please clarify if you mean:

a) have the macro as TARGET_ENABLE_LDAPUR and the md attribute as
enable_ldapur and disable selective cores/arches using
!TARGET_ENABLE_LDAPUR

or 

b) have enable_ldapur for the md attribute and TARGET_AVOID_LDAPUR
as the macro and handle the condition accordingly in the md files?

Sorry, just wanted to confirm because the wording from your earlier
review is confusing me:

>>>> Let's see what others say, but personally, I think this would be better
>>>> as an opt-out, such as avoid_ldapur.  The natural default seems to be to 
>>>> use
>>>> the extra addressing capacity when it's available and have CPUs explicitly
>>>> flag when they don't want that.


Thanks,
Soumya

> Thanks,
> Richard

Re: [PATCH] aarch64: Enable selective LDAPUR generation for cores with RCPC2

Reply via email to