https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95974

            Bug ID: 95974
           Summary: AArch64 arm_neon.h stores interfere with gimple
                    optimisations
           Product: gcc
           Version: 11.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: rsandifo at gcc dot gnu.org
            Blocks: 95958
  Target Milestone: ---
            Target: aarch64*-*-*

For:

---------------------------------------
#include <arm_neon.h>
#include <vector>

std::vector<float> a;

void
f (size_t n, float32x4_t v)
{
  for (size_t i = 0; i < n; i += 4)
    vst1q_f32 (&a[i], v);
}
---------------------------------------

we generate code that loads the start address of
"a" in every iteration of the loop:

---------------------------------------
        cbz     x0, .L4
        adrp    x4, .LANCHOR0
        add     x4, x4, :lo12:.LANCHOR0
        mov     x1, 0
        .p2align 3,,7
.L6:
        ldr     x3, [x4]
        lsl     x2, x1, 2
        add     x1, x1, 4
        str     q0, [x3, x2]
        cmp     x0, x1
        bhi     .L6
.L4:
        ret
---------------------------------------

This is really the store equivalent of PR95962.  The problem is
that __builtin_aarch64_st1v4sf is modelled as a general function
that could read and write from arbitrary memory.  As with PR95962,
one option would be to lower to gimple accesses where possible,
at least for little-endian targets.


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95958
[Bug 95958] [meta-bug] Inefficient arm_neon.h code for AArch64

Reply via email to