Soumya AR <soum...@nvidia.com> writes:
>> On 1 Jul 2025, at 9:22 PM, Kyrylo Tkachov <ktkac...@nvidia.com> wrote:
>> 
>> 
>> 
>>> On 1 Jul 2025, at 17:36, Richard Sandiford <richard.sandif...@arm.com> 
>>> wrote:
>>> 
>>> Soumya AR <soum...@nvidia.com> writes:
>>>> From 2a2c3e3683aaf3041524df166fc6f8cf20895a0b Mon Sep 17 00:00:00 2001
>>>> From: Soumya AR <soum...@nvidia.com>
>>>> Date: Mon, 30 Jun 2025 12:17:30 -0700
>>>> Subject: [PATCH] aarch64: Enable selective LDAPUR generation for cores with
>>>> RCPC2
>>>> 
>>>> This patch adds the ability to fold the address computation into the 
>>>> addressing
>>>> mode for LDAPR instructions using LDAPUR when RCPC2 is available.
>>>> 
>>>> LDAPUR emission is controlled by the tune flag enable_ldapur, to enable it 
>>>> on a
>>>> per-core basis. Earlier, the following code:
>>>> 
>>>> uint64_t
>>>> foo (std::atomic<uint64_t> *x)
>>>> {
>>>> return x[1].load(std::memory_order_acquire);
>>>> }
>>>> 
>>>> would generate:
>>>> 
>>>> foo(std::atomic<unsigned long>*):
>>>> add     x0, x0, 8
>>>> ldapr   x0, [x0]
>>>> ret
>>>> 
>>>> but now generates:
>>>> 
>>>> foo(std::atomic<unsigned long>*):
>>>> ldapur  x0, [x0, 8]
>>>> ret
>>>> 
>>>> The patch was bootstrapped and regtested on aarch64-linux-gnu, no 
>>>> regression.
>>>> OK for mainline?
>>>> 
>>>> Signed-off-by: Soumya AR <soum...@nvidia.com>
>>>> 
>>>> gcc/ChangeLog:
>>>> 
>>>> * config/aarch64/aarch64-tuning-flags.def (AARCH64_EXTRA_TUNING_OPTION):
>>>> Add the enable_ldapur flag to conwtrol LDAPUR emission.
>>>> * config/aarch64/aarch64.h (TARGET_ENABLE_LDAPUR): Use new flag.
>>>> * config/aarch64/aarch64.md (any): Add ldapur_enable attribute.
>>>> * config/aarch64/atomics.md: (aarch64_atomic_load<mode>_rcpc): Modify
>>>> to emit LDAPUR for cores with RCPC2 when enable_ldapur is set.
>>>> (*aarch64_atomic_load<ALLX:mode>_rcpc_zext): Likewise.
>>>> (*aarch64_atomic_load<ALLX:mode>_rcpc_sext): Modified to emit LDAPURS
>>>> for addressing with offsets.
>>>> 
>>>> gcc/testsuite/ChangeLog:
>>>> 
>>>> * gcc.target/aarch64/ldapur.c: New test.
>>> 
>>> Thanks for doing this.  It generally looks good, but a couple of comments
>>> below:
>>> 
>>>> ---
>>>> gcc/config/aarch64/aarch64-tuning-flags.def |  2 +
>>>> gcc/config/aarch64/aarch64.h                |  5 ++
>>>> gcc/config/aarch64/aarch64.md               | 11 +++-
>>>> gcc/config/aarch64/atomics.md               | 22 +++++---
>>>> gcc/testsuite/gcc.target/aarch64/ldapur.c   | 61 +++++++++++++++++++++
>>>> 5 files changed, 92 insertions(+), 9 deletions(-)
>>>> create mode 100644 gcc/testsuite/gcc.target/aarch64/ldapur.c
>>>> 
>>>> diff --git a/gcc/config/aarch64/aarch64-tuning-flags.def 
>>>> b/gcc/config/aarch64/aarch64-tuning-flags.def
>>>> index f2c916e9d77..5bf54165306 100644
>>>> --- a/gcc/config/aarch64/aarch64-tuning-flags.def
>>>> +++ b/gcc/config/aarch64/aarch64-tuning-flags.def
>>>> @@ -44,6 +44,8 @@ AARCH64_EXTRA_TUNING_OPTION ("avoid_cross_loop_fma", 
>>>> AVOID_CROSS_LOOP_FMA)
>>>> 
>>>> AARCH64_EXTRA_TUNING_OPTION ("fully_pipelined_fma", FULLY_PIPELINED_FMA)
>>>> 
>>>> +AARCH64_EXTRA_TUNING_OPTION ("enable_ldapur", ENABLE_LDAPUR)
>>>> +
>>> 
>>> Let's see what others say, but personally, I think this would be better
>>> as an opt-out, such as avoid_ldapur.  The natural default seems to be to use
>>> the extra addressing capacity when it's available and have CPUs explicitly
>>> flag when they don't want that.
>>> 
>>> A good, conservatively correct, default would probably be to add 
>>> avoid_ldapur
>>> to every *current* CPU that includes rcpc2 and then separately remove it
>>> from those that are known not to need it.  In that sense, it's more work
>>> for current CPUs than the current patch, but it should ease the impact
>>> on future CPUs.
>> 
>> LLVM used to do this folding by default everywhere until it was discovered 
>> that it hurts various CPUs.
>> So they’ve taken the approach you describe, and disable the folding 
>> explicitly for:
>> neoverse-v2 neoverse-v3 cortex-x3 cortex-x4 cortex-x925 
>> I don’t know for sure if those are the only CPUs where this applies.
>> They also disable the folding for generic tuning when -march is between 
>> armv8.4 - armv8.7/armv9.2.
>> I guess we can do the same in GCC.
>
> Thanks for your suggestions, Richard and Kyrill.
>
> I've updated the patch to use avoid_ldapur.
>
> There's now an explicit override in aarch64_override_options_internal to use 
> avoid_ldapur for armv8.4 through armv8.7. 
>
> I added it here because aarch64_adjust_generic_arch_tuning is only called for 
> generic_tunings and not generic_armv{8,9}_a_tunings.
>
> Let me know what you think.

Sounds good to me.  I can see that we wouldn't want armv9.3-a+ generic
tuning to be hampered by pre-armv9.3 cores.

But I'm not sure that we're deliberately avoiding calling
aarch64_adjust_generic_arch_tuning for generic_armv{8,9}_a_tunings.
The current TARGET_SVE2 behaviour makes conceptual sense for
generic_armv8_a_tunings too, if someone (unsually) used
-march=armv8-a+sve2.  The current TARGET_SVE2 behaviour would
also be a nop for generic_armv9_a_tunings.

It's probably more that the current SVE2 behaviour isn't particularly
important for generic_armv{8,9}_a_tunings since the defaults are already
correct for the common cases.

So I think we should instead change it to:

  if (tune->tune == &generic_tunings
      || tune->tune == &generic_armv8_a_tunings
      || tune->tune == &generic_armv9_a_tunings)
    aarch64_adjust_generic_arch_tuning (aarch64_tune_params);

and then add the LDAPUR override there, as you mentioned.  Also:

> +    /* Avoid LDAPUR for generic tuning for armv8.4 through armv8.7/armv9.2.  
> */
> +    if (AARCH64_HAVE_ISA(V8_4A) && !AARCH64_HAVE_ISA(V8_8A))
> +      aarch64_tune_params.extra_tuning_flags |= 
> AARCH64_EXTRA_TUNE_AVOID_LDAPUR;

it might be better to drop the Armv8.4-A minimum.  If someone uses
-march=armv8-a+rcpc2 then IMO we ought to avoid LDAPUR.

> diff --git a/gcc/config/aarch64/aarch64.h b/gcc/config/aarch64/aarch64.h
> index e8bd8c73c12..c526594bd1b 100644
> --- a/gcc/config/aarch64/aarch64.h
> +++ b/gcc/config/aarch64/aarch64.h
> @@ -490,6 +490,11 @@ constexpr auto AARCH64_FL_DEFAULT_ISA_MODE 
> ATTRIBUTE_UNUSED
>    (bool (aarch64_tune_params.extra_tuning_flags \
>        & AARCH64_EXTRA_TUNE_CHEAP_FPMR_WRITE))
 
> +/* Avoid folding address computation into LDAPUR for specifc cores.  */
> +#define TARGET_AVOID_LDAPUR (TARGET_RCPC2 \
> +                          && (aarch64_tune_params.extra_tuning_flags \
> +                              & AARCH64_EXTRA_TUNE_AVOID_LDAPUR))
> +
>  /* Combinatorial tests.  */
 
>  #define TARGET_SVE2_OR_SME2 \
> diff --git a/gcc/config/aarch64/aarch64.md b/gcc/config/aarch64/aarch64.md
> index e11e13033d2..a9ea3cb0096 100644
> --- a/gcc/config/aarch64/aarch64.md
> +++ b/gcc/config/aarch64/aarch64.md
> @@ -482,6 +482,8 @@
>  ;; clobber for SVE predicates.
>  (define_attr "pred_clobber" "any,no,yes" (const_string "any"))
> 
> +(define_attr "avoid_ldapur" "any,no,yes" (const_string "any"))
> +
>  ;; [For compatibility with Arm in pipeline models]
>  ;; Attribute that specifies whether or not the instruction touches fp
>  ;; registers.
> @@ -506,7 +508,14 @@
>         (eq_attr "pred_clobber" "yes")
>         (match_test "TARGET_SVE_PRED_CLOBBER"))
>       (eq_attr "pred_clobber" "any"))
> -
> +      (ior
> +     (and
> +       (eq_attr "avoid_ldapur" "yes")
> +       (match_test "TARGET_AVOID_LDAPUR"))
> +     (and
> +       (eq_attr "avoid_ldapur" "no")
> +       (match_test "!TARGET_AVOID_LDAPUR"))
> +     (eq_attr "avoid_ldapur" "any"))
>        (ior
>       (eq_attr "arch" "any")
 
Sorry for the runaround, but: when I mentioned avoid_ldapur last time,
I only meant that for the tuning structure.  The idea was to make it
an opt-out rather than opt-in for core-specific tunings, so that we
don't require every future core to explicitly say that it wants LDAPUR.

I think we should stick to TARGET_ENABLE_LDAPUR for the macro and
"enable_ldapur" for the md attribute.

Thanks,
Richard

Reply via email to