https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95974
Bug ID: 95974 Summary: AArch64 arm_neon.h stores interfere with gimple optimisations Product: gcc Version: 11.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: rsandifo at gcc dot gnu.org Blocks: 95958 Target Milestone: --- Target: aarch64*-*-* For: --------------------------------------- #include <arm_neon.h> #include <vector> std::vector<float> a; void f (size_t n, float32x4_t v) { for (size_t i = 0; i < n; i += 4) vst1q_f32 (&a[i], v); } --------------------------------------- we generate code that loads the start address of "a" in every iteration of the loop: --------------------------------------- cbz x0, .L4 adrp x4, .LANCHOR0 add x4, x4, :lo12:.LANCHOR0 mov x1, 0 .p2align 3,,7 .L6: ldr x3, [x4] lsl x2, x1, 2 add x1, x1, 4 str q0, [x3, x2] cmp x0, x1 bhi .L6 .L4: ret --------------------------------------- This is really the store equivalent of PR95962. The problem is that __builtin_aarch64_st1v4sf is modelled as a general function that could read and write from arbitrary memory. As with PR95962, one option would be to lower to gimple accesses where possible, at least for little-endian targets. Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95958 [Bug 95958] [meta-bug] Inefficient arm_neon.h code for AArch64