https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95964

            Bug ID: 95964
           Summary: AArch64 arm_neon.h arithmetic functions lack
                    appropriate attributes
           Product: gcc
           Version: 11.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: rsandifo at gcc dot gnu.org
            Blocks: 95958
  Target Milestone: ---
            Target: aarch64*-*-*

For:

---------------------------------------
#include <arm_neon.h>
#include <vector>

std::vector<float32x4_t> a, b, c;

void
foo (size_t n)
{
  for (size_t i = 0; i < n; ++i)
    a[i] = vfmaq_f32(a[i], b[i], c[i]);
}
---------------------------------------

we generate code that loads the start of a, b and c
in every iteration of the loop:

---------------------------------------
        .cfi_startproc
        cbz     x0, .L4
        adrp    x3, .LANCHOR0
        add     x3, x3, :lo12:.LANCHOR0
        mov     x2, 0
        .p2align 3,,7
.L6:
        ldr     x4, [x3]
        lsl     x1, x2, 4
        ldr     x6, [x3, 24]
        add     x2, x2, 1
        ldr     x5, [x3, 48]
        ldr     q0, [x4, x1]
        ldr     q2, [x6, x1]
        ldr     q1, [x5, x1]
        fmla    v0.4s, v2.4s, v1.4s
        str     q0, [x4, x1]
        cmp     x0, x2
        bne     .L6
.L4:
        ret
        .cfi_endproc
---------------------------------------

The problem is that __builtin_aarch64_fmav4sf and similar
operations are treated as general functions that can read
memory, write memory, and call other functions.  If the
intrinsic is replaced by arithmetic then the start addresses
are hoisted, as expected.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95958
[Bug 95958] [meta-bug] Inefficient arm_neon.h code for AArch64

Reply via email to