https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95962
Bug ID: 95962 Summary: Inefficient code for simple arm_neon.h iota operation Product: gcc Version: 11.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: rsandifo at gcc dot gnu.org Blocks: 95958 Target Milestone: --- Target: aarch64*-*-* For: #include <arm_neon.h> int32x4_t foo (void) { int32_t array[] = { 0, 1, 2, 3 }; return vld1q_s32 (array); } we produce: foo: .LFB4217: .cfi_startproc sub sp, sp, #16 .cfi_def_cfa_offset 16 mov x0, 2 mov x1, 4294967296 movk x0, 0x3, lsl 32 stp x1, x0, [sp] ldr q0, [sp] add sp, sp, 16 .cfi_def_cfa_offset 0 ret In contrast, clang produces essentially perfect code: adrp x8, .LCPI0_0 ldr q0, [x8, :lo12:.LCPI0_0] ret I think the problem is a combination of two things: - __builtin_aarch64_ld1v4si & co. are treated as general functions rather than pure functions, so in principle it could write to the given address. This stops us promoting the array to a constant. - The loads could be reduced to native gimple-level operations, at least on little-endian targets. IMO this a bug rather than an enhancement. Intrinsics only exist to optimise code, and what GCC is doing falls short of what users should reasonably expect. Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95958 [Bug 95958] [meta-bug] Inefficient arm_neon.h code for AArch64