https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95962

            Bug ID: 95962
           Summary: Inefficient code for simple arm_neon.h iota operation
           Product: gcc
           Version: 11.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: rsandifo at gcc dot gnu.org
            Blocks: 95958
  Target Milestone: ---
            Target: aarch64*-*-*

For:

#include <arm_neon.h>

int32x4_t
foo (void)
{
  int32_t array[] = { 0, 1, 2, 3 };
  return vld1q_s32 (array);
}

we produce:

foo:
.LFB4217:
        .cfi_startproc
        sub     sp, sp, #16
        .cfi_def_cfa_offset 16
        mov     x0, 2
        mov     x1, 4294967296
        movk    x0, 0x3, lsl 32
        stp     x1, x0, [sp]
        ldr     q0, [sp]
        add     sp, sp, 16
        .cfi_def_cfa_offset 0
        ret

In contrast, clang produces essentially perfect code:

        adrp    x8, .LCPI0_0
        ldr     q0, [x8, :lo12:.LCPI0_0]
        ret

I think the problem is a combination of two things:

- __builtin_aarch64_ld1v4si & co. are treated as general
  functions rather than pure functions, so in principle
  it could write to the given address.  This stops us
  promoting the array to a constant.

- The loads could be reduced to native gimple-level
  operations, at least on little-endian targets.

IMO this a bug rather than an enhancement.  Intrinsics only
exist to optimise code, and what GCC is doing falls short
of what users should reasonably expect.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95958
[Bug 95958] [meta-bug] Inefficient arm_neon.h code for AArch64

Reply via email to