Re: [PATCH] aarch64: Enable selective LDAPUR generation for cores with RCPC2

Richard Sandiford Thu, 10 Jul 2025 04:51:14 -0700

Soumya AR <soum...@nvidia.com> writes:
>> On 10 Jul 2025, at 3:15 PM, Richard Sandiford <richard.sandif...@arm.com> 
>> wrote:
>> 
>> External email: Use caution opening links or attachments
>> 
>> 
>> Soumya AR <soum...@nvidia.com> writes:
>>>> On 1 Jul 2025, at 9:22 PM, Kyrylo Tkachov <ktkac...@nvidia.com> wrote:
>>>> 
>>>> 
>>>> 
>>>>> On 1 Jul 2025, at 17:36, Richard Sandiford <richard.sandif...@arm.com> 
>>>>> wrote:
>>>>> 
>>>>> Soumya AR <soum...@nvidia.com> writes:
>>>>>> From 2a2c3e3683aaf3041524df166fc6f8cf20895a0b Mon Sep 17 00:00:00 2001
>>>>>> From: Soumya AR <soum...@nvidia.com>
>>>>>> Date: Mon, 30 Jun 2025 12:17:30 -0700
>>>>>> Subject: [PATCH] aarch64: Enable selective LDAPUR generation for cores 
>>>>>> with
>>>>>> RCPC2
>>>>>> 
>>>>>> This patch adds the ability to fold the address computation into the 
>>>>>> addressing
>>>>>> mode for LDAPR instructions using LDAPUR when RCPC2 is available.
>>>>>> 
>>>>>> LDAPUR emission is controlled by the tune flag enable_ldapur, to enable 
>>>>>> it on a
>>>>>> per-core basis. Earlier, the following code:
>>>>>> 
>>>>>> uint64_t
>>>>>> foo (std::atomic<uint64_t> *x)
>>>>>> {
>>>>>> return x[1].load(std::memory_order_acquire);
>>>>>> }
>>>>>> 
>>>>>> would generate:
>>>>>> 
>>>>>> foo(std::atomic<unsigned long>*):
>>>>>> add     x0, x0, 8
>>>>>> ldapr   x0, [x0]
>>>>>> ret
>>>>>> 
>>>>>> but now generates:
>>>>>> 
>>>>>> foo(std::atomic<unsigned long>*):
>>>>>> ldapur  x0, [x0, 8]
>>>>>> ret
>>>>>> 
>>>>>> The patch was bootstrapped and regtested on aarch64-linux-gnu, no 
>>>>>> regression.
>>>>>> OK for mainline?
>>>>>> 
>>>>>> Signed-off-by: Soumya AR <soum...@nvidia.com>
>>>>>> 
>>>>>> gcc/ChangeLog:
>>>>>> 
>>>>>> * config/aarch64/aarch64-tuning-flags.def (AARCH64_EXTRA_TUNING_OPTION):
>>>>>> Add the enable_ldapur flag to conwtrol LDAPUR emission.
>>>>>> * config/aarch64/aarch64.h (TARGET_ENABLE_LDAPUR): Use new flag.
>>>>>> * config/aarch64/aarch64.md (any): Add ldapur_enable attribute.
>>>>>> * config/aarch64/atomics.md: (aarch64_atomic_load<mode>_rcpc): Modify
>>>>>> to emit LDAPUR for cores with RCPC2 when enable_ldapur is set.
>>>>>> (*aarch64_atomic_load<ALLX:mode>_rcpc_zext): Likewise.
>>>>>> (*aarch64_atomic_load<ALLX:mode>_rcpc_sext): Modified to emit LDAPURS
>>>>>> for addressing with offsets.
>>>>>> 
>>>>>> gcc/testsuite/ChangeLog:
>>>>>> 
>>>>>> * gcc.target/aarch64/ldapur.c: New test.
>>>>> 
>>>>> Thanks for doing this.  It generally looks good, but a couple of comments
>>>>> below:
>>>>> 
>>>>>> ---
>>>>>> gcc/config/aarch64/aarch64-tuning-flags.def |  2 +
>>>>>> gcc/config/aarch64/aarch64.h                |  5 ++
>>>>>> gcc/config/aarch64/aarch64.md               | 11 +++-
>>>>>> gcc/config/aarch64/atomics.md               | 22 +++++---
>>>>>> gcc/testsuite/gcc.target/aarch64/ldapur.c   | 61 +++++++++++++++++++++
>>>>>> 5 files changed, 92 insertions(+), 9 deletions(-)
>>>>>> create mode 100644 gcc/testsuite/gcc.target/aarch64/ldapur.c
>>>>>> 
>>>>>> diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def 
>>>>>> b/gcc/config/aarch64/aarch64-tuning-flags.def
>>>>>> index f2c916e9d77..5bf54165306 100644
>>>>>> --- a/gcc/config/aarch64/aarch64-tuning-flags.def
>>>>>> +++ b/gcc/config/aarch64/aarch64-tuning-flags.def
>>>>>> @@ -44,6 +44,8 @@ AARCH64_EXTRA_TUNING_OPTION ("avoid_cross_loop_fma", 
>>>>>> AVOID_CROSS_LOOP_FMA)
>>>>>> 
>>>>>> AARCH64_EXTRA_TUNING_OPTION ("fully_pipelined_fma", FULLY_PIPELINED_FMA)
>>>>>> 
>>>>>> +AARCH64_EXTRA_TUNING_OPTION ("enable_ldapur", ENABLE_LDAPUR)
>>>>>> +
>>>>> 
>>>>> Let's see what others say, but personally, I think this would be better
>>>>> as an opt-out, such as avoid_ldapur.  The natural default seems to be to 
>>>>> use
>>>>> the extra addressing capacity when it's available and have CPUs explicitly
>>>>> flag when they don't want that.
>>>>> 
>>>>> A good, conservatively correct, default would probably be to add 
>>>>> avoid_ldapur
>>>>> to every *current* CPU that includes rcpc2 and then separately remove it
>>>>> from those that are known not to need it.  In that sense, it's more work
>>>>> for current CPUs than the current patch, but it should ease the impact
>>>>> on future CPUs.
>>>> 
>>>> LLVM used to do this folding by default everywhere until it was discovered 
>>>> that it hurts various CPUs.
>>>> So they’ve taken the approach you describe, and disable the folding 
>>>> explicitly for:
>>>> neoverse-v2 neoverse-v3 cortex-x3 cortex-x4 cortex-x925
>>>> I don’t know for sure if those are the only CPUs where this applies.
>>>> They also disable the folding for generic tuning when -march is between 
>>>> armv8.4 - armv8.7/armv9.2.
>>>> I guess we can do the same in GCC.
>>> 
>>> Thanks for your suggestions, Richard and Kyrill.
>>> 
>>> I've updated the patch to use avoid_ldapur.
>>> 
>>> There's now an explicit override in aarch64_override_options_internal to 
>>> use avoid_ldapur for armv8.4 through armv8.7.
>>> 
>>> I added it here because aarch64_adjust_generic_arch_tuning is only called 
>>> for generic_tunings and not generic_armv{8,9}_a_tunings.
>>> 
>>> Let me know what you think.
>> 
>> Sounds good to me.  I can see that we wouldn't want armv9.3-a+ generic
>> tuning to be hampered by pre-armv9.3 cores.
>> 
>> But I'm not sure that we're deliberately avoiding calling
>> aarch64_adjust_generic_arch_tuning for generic_armv{8,9}_a_tunings.
>> The current TARGET_SVE2 behaviour makes conceptual sense for
>> generic_armv8_a_tunings too, if someone (unsually) used
>> -march=armv8-a+sve2.  The current TARGET_SVE2 behaviour would
>> also be a nop for generic_armv9_a_tunings.
>> 
>> It's probably more that the current SVE2 behaviour isn't particularly
>> important for generic_armv{8,9}_a_tunings since the defaults are already
>> correct for the common cases.
>> 
>> So I think we should instead change it to:
>> 
>>  if (tune->tune == &generic_tunings
>>      || tune->tune == &generic_armv8_a_tunings
>>      || tune->tune == &generic_armv9_a_tunings)
>>    aarch64_adjust_generic_arch_tuning (aarch64_tune_params);
>> 
>> and then add the LDAPUR override there, as you mentioned.  Also:
>> 
>>> +    /* Avoid LDAPUR for generic tuning for armv8.4 through 
>>> armv8.7/armv9.2.  */
>>> +    if (AARCH64_HAVE_ISA(V8_4A) && !AARCH64_HAVE_ISA(V8_8A))
>>> +      aarch64_tune_params.extra_tuning_flags |= 
>>> AARCH64_EXTRA_TUNE_AVOID_LDAPUR;
>> 
>> it might be better to drop the Armv8.4-A minimum.  If someone uses
>> -march=armv8-a+rcpc2 then IMO we ought to avoid LDAPUR.
>> 
>>> diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
>>> index e8bd8c73c12..c526594bd1b 100644
>>> --- a/gcc/config/aarch64/aarch64.h
>>> +++ b/gcc/config/aarch64/aarch64.h
>>> @@ -490,6 +490,11 @@ constexpr auto AARCH64_FL_DEFAULT_ISA_MODE 
>>> ATTRIBUTE_UNUSED
>>>   (bool (aarch64_tune_params.extra_tuning_flags \
>>>       & AARCH64_EXTRA_TUNE_CHEAP_FPMR_WRITE))
>> 
>>> +/* Avoid folding address computation into LDAPUR for specifc cores.  */
>>> +#define TARGET_AVOID_LDAPUR (TARGET_RCPC2 \
>>> +                          && (aarch64_tune_params.extra_tuning_flags \
>>> +                              & AARCH64_EXTRA_TUNE_AVOID_LDAPUR))
>>> +
>>> /* Combinatorial tests.  */
>> 
>>> #define TARGET_SVE2_OR_SME2 \
>>> diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
>>> index e11e13033d2..a9ea3cb0096 100644
>>> --- a/gcc/config/aarch64/aarch64.md
>>> +++ b/gcc/config/aarch64/aarch64.md
>>> @@ -482,6 +482,8 @@
>>> ;; clobber for SVE predicates.
>>> (define_attr "pred_clobber" "any,no,yes" (const_string "any"))
>>> 
>>> +(define_attr "avoid_ldapur" "any,no,yes" (const_string "any"))
>>> +
>>> ;; [For compatibility with Arm in pipeline models]
>>> ;; Attribute that specifies whether or not the instruction touches fp
>>> ;; registers.
>>> @@ -506,7 +508,14 @@
>>>        (eq_attr "pred_clobber" "yes")
>>>        (match_test "TARGET_SVE_PRED_CLOBBER"))
>>>      (eq_attr "pred_clobber" "any"))
>>> -
>>> +      (ior
>>> +     (and
>>> +       (eq_attr "avoid_ldapur" "yes")
>>> +       (match_test "TARGET_AVOID_LDAPUR"))
>>> +     (and
>>> +       (eq_attr "avoid_ldapur" "no")
>>> +       (match_test "!TARGET_AVOID_LDAPUR"))
>>> +     (eq_attr "avoid_ldapur" "any"))
>>>       (ior
>>>      (eq_attr "arch" "any")
>> 
>> Sorry for the runaround, but: when I mentioned avoid_ldapur last time,
>> I only meant that for the tuning structure.  The idea was to make it
>> an opt-out rather than opt-in for core-specific tunings, so that we
>> don't require every future core to explicitly say that it wants LDAPUR.
>> 
>> I think we should stick to TARGET_ENABLE_LDAPUR for the macro and
>> "enable_ldapur" for the md attribute.
>> 
>
> Hi Richard,
>
> Thanks for the suggestions.
>
> Before I respin the patch, could you please clarify if you mean:
>
> a) have the macro as TARGET_ENABLE_LDAPUR and the md attribute as
> enable_ldapur and disable selective cores/arches using
> !TARGET_ENABLE_LDAPUR
>
> or 
>
> b) have enable_ldapur for the md attribute and TARGET_AVOID_LDAPUR
> as the macro and handle the condition accordingly in the md files?


I meant (a).  That is:

- AARCH64_EXTRA_TUNING_OPTION: "avoid_ldapur" (as in your new patch)
- tuning structures use AARCH64_EXTRA_TUNE_AVOID_LDAPUR (as in your new patch)
- #define: TARGET_ENABLE_LDAPUR (as in your original patch)
- define_attr: "enable_ldapur" (a variant of your original patch)

The macro would be defined as:

/* Enable folding address computation into LDAPUR when RCPC2 is available.  */
#define TARGET_ENABLE_LDAPUR (TARGET_RCPC2 \
                              && !(aarch64_tune_params.extra_tuning_flags \
                                   & AARCH64_EXTRA_TUNE_AVOID_LDAPUR))

> Sorry, just wanted to confirm because the wording from your earlier
> review is confusing me:
>
>>>>> Let's see what others say, but personally, I think this would be better
>>>>> as an opt-out, such as avoid_ldapur.  The natural default seems to be to 
>>>>> use
>>>>> the extra addressing capacity when it's available and have CPUs explicitly
>>>>> flag when they don't want that.

Yeah, this was all talking specifically about the tuning option.
What I didn't want was for some CPU added 10 years from now to have
to explicitly add AARCH64_EXTRA_TUNE_ENABLE_LDAPUR to its tuning
structure in order to use LDAPUR.  The set of CPUs that struggle
with LDAPUR is hopefully frozen and it isn't something that future
CPUs should have to worry about.

So the principle was: tuning options should be structured so that CPUs
explicitly opt out of standard architecture features, rather than opt in.
But code generation should continue to be gated on target macros that
say "use these instructions" rather than "don't use these instructions",
like it is for other things.  The fact that the tuning structure
provides an opt-out should be localised to the target macro.

Sorry for the confusion!

Richard

Re: [PATCH] aarch64: Enable selective LDAPUR generation for cores with RCPC2

Reply via email to