This is a summary of discussions relative to the merge request created by Karl
Meakin (karmea01) <[email protected]> titled
aarch64: port NEON intrinsics to pragma-based framework
since its creation.
Description: This patch is a proof of concept patch which ports a few NEON
intrinsics (intrinsics defined in `arm_neon.h`) to the "pragma-based" framework
used by SVE/SME intrinsics.
If successful, I will follow up with further patches porting the rest.
tested with `make check`
changelog:
* v1: Initial revision
* v2: Appease `check_GNU_style.py`
* v3: Drop unrelated `.editorconfig` changes which were included by mistake
* v4:
* Address review comments
* Move reformatting of `config.gcc` into its own commit.
* Merge `aarch64-neon-builtins.cc` into `aarch64-sve-builtins.cc` and
rename it to `aarch64-acle-builtins.cc`
* v5: Fix codegen for big-endian targets
* v6 Improve codegen for `FEAT_SHA3` intrinsics (`veor3`, `vbcax`, `vrax1` and
`vxar`) at `-O0`.
* v7 Delete `aarch64-neon-builtins.cc` again after it somehow got reintroduced
in v6
* v8 Remove RFC tag, rebase against master
* v9 Use the new `IFN_BITREVERSE` when lowering `rbit`
* v10:
* Address review comments
* Split the commit porting vector manipulation intrinsics into two commits:
one for vector creation, and one for lane getters and setters
CC: [email protected]
CC: [email protected]
CC: [email protected]
CC: [email protected]
The full and up to date discussion can be found at
https://forge.sourceware.org/gcc/gcc-TEST/pulls/158
The merge request has been closed without being merged directly on the forge
repository.
On 2026-05-13 16:28:51+00:00, Claudio Bantaloukas (rdfm) <[email protected]>
requested changes to the code:
This is a great start and I've very excited to see the patch series land. I
have some comments and hopefully more people will chime in.
> +++ .editorconfig
Seems unrelated :)
> +++ gcc/config/aarch64/aarch64-sve-builtins.cc
> @@ -3333,3 +2529,1 @@
> - /* The type and range are unsigned, so read the argument as an
> - unsigned rather than signed HWI. */
> - if (!tree_fits_uhwi_p (arg))
> + if (tree_fits_shwi_p (arg))
should this function be moved in a more generic file than sve?
> +++ gcc/config/aarch64/aarch64-sve-builtins.cc
> @@ -3333,3 +2529,1 @@
> - /* The type and range are unsigned, so read the argument as an
> - unsigned rather than signed HWI. */
> - if (!tree_fits_uhwi_p (arg))
> + if (tree_fits_shwi_p (arg))
What about renaming the whole file to `aarch64-acle-builtins.cc`?
> +++ gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/bf16_dup.c
> @@ -42,3 +42,3 @@
> return vdupq_lane_bf16 (a, 1);
> }
> -/* { dg-final { scan-assembler-times "dup\\tv\[0-9\]+\.8h,
> v\[0-9\]+.h\\\[0\\\]" 2 } } */
> +/* { dg-final { scan-assembler-times "dup\\tv\[0-9\]+\.8h,
> v\[0-9\]+.h\\\[0\\\]" 1 } } */
This is surprising, if this was a bug in the previous implementation or a
wanted change, please document it in the patch description.
> +++ gcc/testsuite/gcc.target/aarch64/advsimd-intrinsics/bf16_dup.c
> @@ -42,3 +42,3 @@
> return vdupq_lane_bf16 (a, 1);
> }
> -/* { dg-final { scan-assembler-times "dup\\tv\[0-9\]+\.8h,
> v\[0-9\]+.h\\\[0\\\]" 2 } } */
> +/* { dg-final { scan-assembler-times "dup\\tv\[0-9\]+\.8h,
> v\[0-9\]+.h\\\[0\\\]" 1 } } */
It's an improvement in the generated code. [Previously these two functions
generated](https://godbolt.org/z/jnEbKjqYf)
```asm
vdupq_test:
dup v0.8h, v0.h[0]
ret
test_vdupq_lane_bf16:
dup h0, v0.h[1]
dup v0.8h, v0.h[0]
ret
```
Now they generate
```asm
vdupq_test:
dup v0.8h, v0.h[0]
ret
test_vdupq_lane_bf16:
dup v0.8h, v0.h[1]
ret
```
> +++ gcc/testsuite/gcc.target/aarch64/lane-bound-3.c
> @@ -15,3 +15,1 @@
> - /* Use vgetq_lane_u64 to get a
> - __builtin_aarch64_im_lane_boundsi */
> - vgetq_lane_u64(c, __b);
> + __builtin_aarch64_im_lane_boundsi (sizeof (c), sizeof (c[0]), __b);
@pinskia will this still trigger the case you fixed for
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117665
> +++ gcc/testsuite/gcc.target/aarch64/lane-bound-3.c
> @@ -15,3 +15,1 @@
> - /* Use vgetq_lane_u64 to get a
> - __builtin_aarch64_im_lane_boundsi */
> - vgetq_lane_u64(c, __b);
> + __builtin_aarch64_im_lane_boundsi (sizeof (c), sizeof (c[0]), __b);
Yes replacing it with __builtin_aarch64_im_lane_boundsi will invoke the ICE
that was previously there. But I suspect we should have 2 versions of the
testcase, one that still uses the intrinsics and one that uses the
`__builtin_aarch64_im_lane_boundsi` builtin. Since the original code used the
intrinsics and not the builtin directly. We want to make sure the original
testcase does not regress either.
> +++ gcc/testsuite/gcc.target/aarch64/lane-bound-3.c
> @@ -15,3 +15,1 @@
> - /* Use vgetq_lane_u64 to get a
> - __builtin_aarch64_im_lane_boundsi */
> - vgetq_lane_u64(c, __b);
> + __builtin_aarch64_im_lane_boundsi (sizeof (c), sizeof (c[0]), __b);
The `vgetq_lane_u64` intrinsic does not call
`__builtin_aarch64_im_lane_boundsi` anymore, since the checking of the
arguments is now done directly. I could replace it with another intrinsic that
hasn't been ported yet, but then I would have to update the test again once
that intrinsic is ported to the new framework. Ultimately, once all the
intrinsics have been ported, `__builtin_aarch64_im_lane_boundsi` and this test
can be deleted.
> +++ gcc/testsuite/gcc.target/aarch64/lane-bound-3.c
> @@ -15,3 +15,1 @@
> - /* Use vgetq_lane_u64 to get a
> - __builtin_aarch64_im_lane_boundsi */
> - vgetq_lane_u64(c, __b);
> + __builtin_aarch64_im_lane_boundsi (sizeof (c), sizeof (c[0]), __b);
Actually thinking about this slightly more. We should do both testcases. One
with the original `vgetq_lane_u64` instrinsics and one with
`__builtin_aarch64_im_lane_boundsi`. To make sure the original testcase that
was provided in the bug report does not regress and one for the
`__builtin_aarch64_im_lane_boundsi` which we caused the issue.
> +++ gcc/testsuite/gcc.target/aarch64/sha3_1.c
> @@ -1,5 +1,5 @@
> /* { dg-do compile } */
> -/* { dg-options "-march=armv8.2-a+sha3" } */
> +/* { dg-options "-O1 -march=armv8.2-a+sha3" } */
What kind of issues does the absence of -O1 cause?
> +++ gcc/testsuite/gcc.target/aarch64/sha3_1.c
> @@ -1,5 +1,5 @@
> /* { dg-do compile } */
> -/* { dg-options "-march=armv8.2-a+sha3" } */
> +/* { dg-options "-O1 -march=armv8.2-a+sha3" } */
There is a regression in code quality at `-O0`. Since the intrinsics now expand
to several GIMPLE statements, they produce assembly like
https://godbolt.org/z/jsjzbz19P
```asm
xor3(__Uint8x16_t, __Uint8x16_t, __Uint8x16_t):
sub sp, sp, #48
str q0, [sp, 32]
str q1, [sp, 16]
str q2, [sp]
ldr q30, [sp, 32]
ldr q31, [sp, 16]
eor v30.16b, v30.16b, v31.16b
ldr q31, [sp]
eor v31.16b, v30.16b, v31.16b
mov v0.16b, v31.16b
add sp, sp, 48
ret
```
instead of
```asm
xor3(__Uint8x16_t, __Uint8x16_t, __Uint8x16_t):
sub sp, sp, #96
str q0, [sp, 32]
str q1, [sp, 16]
str q2, [sp]
ldr q31, [sp, 32]
str q31, [sp, 80]
ldr q31, [sp, 16]
str q31, [sp, 64]
ldr q31, [sp]
str q31, [sp, 48]
ldr q31, [sp, 80]
ldr q29, [sp, 64]
ldr q30, [sp, 48]
eor3 v31.16b, v31.16b, v29.16b, v30.16b
nop
mov v0.16b, v31.16b
add sp, sp, 96
ret
```
The two EORs get combined into EOR3 in the RTL combine pass, which doesn't run
at `-O0`
> +++ gcc/testsuite/gcc.target/aarch64/sme/inlining_10.c
> @@ -51,7 +45,6 @@ void
> sc_caller () [[arm::inout("za"), arm::streaming_compatible]]
> {
> call_vadd ();
> - call_vbsl ();
bsl and all the other AdvSIMD functions are not compatible with straming mode.
See CheckFPAdvSIMDEnabled in
https://developer.arm.com/documentation/ddi0602/2026-03/Shared-Pseudocode/aarch64-exceptions-traps?lang=en#func_AArch64_CheckFPAdvSIMDEnabled_0
I think the patch should be amended to maintain the existing behaviour.
> +++ gcc/config.gcc
> @@ -365,0 +365,4 @@
> + 'arm_fp16.h'
> + 'arm_neon.h'
> + 'arm_bf16.h'
> + 'arm_acle.h'
This is not portable to /bin/sh (sorry!)
you could do something like `extra_objs="${extra_objs} aarch64-sve-builtins.o"`
instead
> +++ gcc/config.gcc
> @@ -362,3 +362,3 @@
> aarch64*-*-*)
> cpu_type=aarch64
> - extra_headers="arm_fp16.h arm_neon.h arm_bf16.h arm_acle.h arm_sve.h
> arm_sme.h arm_neon_sve_bridge.h arm_private_fp8.h arm_private_neon_types.h"
> + extra_headers=(
See comment on previous patch. These should be /bin/sh compliant.
On 2026-05-14 18:23:18+00:00, Drea Pinski (pinskia) <[email protected]>
requested changes to the code:
Move the config.gcc reformating to a seperate patch. I will let others decide
if the reformating is ok though.
I am ok with it but others might not be.
> +++ gcc/config.gcc
> @@ -365,0 +370,4 @@
> + extra_headers="${extra_headers} arm_sme.h"
> + extra_headers="${extra_headers} arm_neon_sve_bridge.h"
> + extra_headers="${extra_headers} arm_private_fp8.h"
> + extra_headers="${extra_headers} arm_private_neon_types.h"
Can you do reformating this as the first patch?
> +++ gcc/config.gcc
> @@ -370,0 +395,4 @@
> + extra_objs="${extra_objs} aarch64-narrow-gp-writes.o"
> + extra_objs="${extra_objs} aarch64-neon-builtins.o"
> + extra_objs="${extra_objs} aarch64-neon-builtins-base.o"
> + extra_objs="${extra_objs} aarch64-neon-builtins-shapes.o"
Likewise.
> +++ gcc/config.gcc
> @@ -370,0 +404,4 @@
> + target_gtfiles="${target_gtfiles}
> \$(srcdir)/config/aarch64/aarch64-acle-builtins.h"
> + target_gtfiles="${target_gtfiles}
> \$(srcdir)/config/aarch64/aarch64-sve-builtins.cc"
> + target_gtfiles="${target_gtfiles}
> \$(srcdir)/config/aarch64/aarch64-neon-builtins.cc"
> + target_gtfiles="${target_gtfiles}
> \$(srcdir)/config/aarch64/aarch64-neon-builtins.h"
Likewise.
On 2026-05-27 15:00:15+00:00, Karl Meakin (karmea01) <[email protected]>
commented on the code:
> +++ gcc/config/aarch64/aarch64-neon-builtins-base.cc
> @@ -0,0 +328,4 @@
> +public:
> + constexpr gimple_not_rhs (tree_code code)
> + : m_code (code)
> + {}
An IFN for bitreverse over vectors may be coming soon. Keep an eye on
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=50481
On 2026-06-01 18:05:35+00:00, Karl Meakin (karmea01) wrote:
I've removed the RFC tag; I think its good enough to merge now. Ok to merge?
On 2026-06-02 08:35:16+00:00, Claudio Bantaloukas (rdfm) <[email protected]>
approved the changes:
LGTM apart from a small formatting nit. But I'm not a maintainer.
Thank you for addressing concerns.
> +++ gcc/config/aarch64/aarch64-builtins.cc
Spurious spacing?
On 2026-06-08 14:19:35+00:00, Karl Meakin (karmea01) wrote:
@pinskia ping?
On 2026-06-17 13:05:57+00:00, Kyrill Tkachov (ktkachov) <[email protected]>
commented on the code:
Thanks, this is a long-awaited transition. Some comments I've found
> +++ gcc/config/aarch64/aarch64-neon-builtins-base.def
> @@ -0,0 +71,4 @@
> +#undef REQUIRED_EXTENSIONS
> +
> +// Lanewise arithmetic (FP16)
> +#define REQUIRED_EXTENSIONS nonstreaming_only (AARCH64_FL_F16)
I think this needs to be AARCH64_FL_SIMD | AARCH64_FL_F16
> +++ gcc/config/aarch64/aarch64-neon-builtins-base.cc
> @@ -0,0 +49,4 @@
> +/* Build a cast expression, `(TYPE)EXPR`, if necessary to make an expression
> + with type TYPE. */
> +tree
> +build_cast (tree type, tree expr)
This and other functions here have quite generic names and sit at global scope
with external linkage. I think these should be wrapped in the aarch64_acle
namespace or something
> +++ gcc/config/aarch64/t-aarch64
> @@ -71,0 +94,4 @@
> + $(COMPILER) -c $(ALL_COMPILERFLAGS) $(ALL_CPPFLAGS) $(INCLUDES) \
> + $(srcdir)/config/aarch64/aarch64-neon-builtins-base.cc
> +
> +aarch64-acle-builtins.o: $(srcdir)/config/aarch64/aarch64-acle-builtins.cc \
I think this rule should also include the new *neon*.def/.h files, as well as
gimple-fold.h
> +++ gcc/config/aarch64/aarch64-builtins.cc
> @@ -1974,3 +1731,3 @@
> aarch64_init_simd_builtin_functions (false);
> if (in_lto_p)
> - handle_arm_neon_h ();
> + init_arm_neon_builtins ();
Do we get in a situation where init_arm_neon_builtins is called twice? Does
this need to take to take arm_neon_h_handled into account?
> +++ gcc/config/aarch64/aarch64-builtins.cc
> @@ -1974,3 +1731,3 @@
> aarch64_init_simd_builtin_functions (false);
> if (in_lto_p)
> - handle_arm_neon_h ();
> + init_arm_neon_builtins ();
No, it is called either from `handle_arm_neon_h` (when encountering `#pragma
GCC aarch64 "arm_neon.h"`) or from `aarch64_init_simd_builtins` (when
initialising LTO). So it is called in two separate places, but will not be
called in both places in the same compilation.
`handle_arm_neon_h` checks for `arm_neon_h_handled`
On 2026-06-23 07:10:40+00:00, Tamar Christina (tnfchris) <[email protected]>
requested changes to the code:
Looks pretty good, just some minor changes.
> +++ gcc/testsuite/gcc.target/aarch64/sme/inlining_10.c
> @@ -21,1 +20,3 @@
> -call_vbsl () // { dg-error "inlining failed" }
> +// Gets expanded to bitwise select early, so no error. An error would be
> +// more correct though.
> +inline void __attribute__ ((always_inline))
I think BSL was chosen here not because we wanted to test `bsl` itself but
because we wanted to test the inlining behavior of a non-lowered intrinsics.
Now that you lower `bsl` instead of removing the error you should pick another
intrinsics that isn't lowered. Otherwise we both checks here doesn't check
inlining errors.
Same with the below.
> +++ gcc/testsuite/gcc.target/aarch64/neon/aarch64-neon.exp
> @@ -0,0 +33,4 @@
> +
> +# Main loop.
> +dg-runtest [lsort [glob -nocomplain $srcdir/$subdir/*\[cCs\]]] \
> + " -ansi -pedantic-errors -std=c23 -O3 -march=armv8-a+simd" ""
I think we just want -O2 here, since that's the standard compile flag for most.
Or even -O1 since intrinsics shouldn't rely on optimizations to produce the
result stated by ACLE.
> +++ gcc/config/aarch64/aarch64-neon-builtins-shapes.cc
> @@ -0,0 +108,4 @@
> + check_fn_t m_check_fn1;
> + check_fn_t m_check_fn2;
> +
> + void build (function_builder &b,
In the removed code above the comment said
```
case UNSPEC_VEC_COPY:
/* & rather than && so that we report errors against both indices. */
return (require_immediate_lane_index (1, 0)
& require_immediate_lane_index (3, 2));
```
and now we only report the first error. I think we should restore that
behavior, so just use `&` instead of `&&` here to report all errors at the same
time.
> +++ gcc/config/aarch64/aarch64-neon-builtins-base.cc
> @@ -0,0 +302,4 @@
> +};
> +
> +struct gimple_dup_lane : public gimple_function_base
> +{
Doesn't this also need
```
if (optimize == 0)
return nullptr;
```
> +++ gcc/config/aarch64/aarch64-neon-builtins-base.cc
> @@ -0,0 +302,4 @@
> +};
> +
> +struct gimple_dup_lane : public gimple_function_base
> +{
No, we only emit RTL for SHA3 intrinsics because we want to ensure they emit a
single instruction even at `-O0`
> +++ gcc/config/aarch64/aarch64-simd.md
> @@ -9854,3 +9855,4 @@
> [(set_attr "type" "crypto_sha3")]
> )
>
> +(define_insn "aarch64_rax1qv2di"
this one isn't needed, just make the `*aarch64_rax1qv2di` above not anonymous.
On 2026-06-29 19:27:31+00:00, Tamar Christina (tnfchris) <[email protected]>
approved the changes:
Thanks! Lets get this in.
On 2026-06-30 14:48:21+00:00, Karl Meakin (karmea01) wrote:
Merged