https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95964
Bug ID: 95964
Summary: AArch64 arm_neon.h arithmetic functions lack
appropriate attributes
Product: gcc
Version: 11.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: rsandifo at gcc dot gnu.org
Blocks: 95958
Target Milestone: ---
Target: aarch64*-*-*
For:
---------------------------------------
#include <arm_neon.h>
#include <vector>
std::vector<float32x4_t> a, b, c;
void
foo (size_t n)
{
for (size_t i = 0; i < n; ++i)
a[i] = vfmaq_f32(a[i], b[i], c[i]);
}
---------------------------------------
we generate code that loads the start of a, b and c
in every iteration of the loop:
---------------------------------------
.cfi_startproc
cbz x0, .L4
adrp x3, .LANCHOR0
add x3, x3, :lo12:.LANCHOR0
mov x2, 0
.p2align 3,,7
.L6:
ldr x4, [x3]
lsl x1, x2, 4
ldr x6, [x3, 24]
add x2, x2, 1
ldr x5, [x3, 48]
ldr q0, [x4, x1]
ldr q2, [x6, x1]
ldr q1, [x5, x1]
fmla v0.4s, v2.4s, v1.4s
str q0, [x4, x1]
cmp x0, x2
bne .L6
.L4:
ret
.cfi_endproc
---------------------------------------
The problem is that __builtin_aarch64_fmav4sf and similar
operations are treated as general functions that can read
memory, write memory, and call other functions. If the
intrinsic is replaced by arithmetic then the start addresses
are hoisted, as expected.
Referenced Bugs:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95958
[Bug 95958] [meta-bug] Inefficient arm_neon.h code for AArch64