https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89607

            Bug ID: 89607
           Summary: Missing optimization for store of multiple registers
                    on arm and aarch64
           Product: gcc
           Version: 8.2.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: yyc1992 at gmail dot com
  Target Milestone: ---

Test code, Compiled for arm/aarch64 with -O1/-O2/-O3/-Os/-Ofast

```
#include <arm_neon.h>

void f4(float32x4x2_t *p, const float *p1)
{
    *p = vld2q_f32(p1);
}

void f5(float32x4x2_t *p, float32x4_t v1, float32x4_t v2)
{
    p->val[0] = v1;
    p->val[1] = v2;
}
```

arm:

```
f4:
        vld2.32 {d16-d19}, [r1]
        vst1.64 {d16-d19}, [r0:64]
        bx      lr
f5:
        vst1.64 {d0-d1}, [r0:64]
        vstr    d2, [r0, #16]
        vstr    d3, [r0, #24]
        bx      lr
```

aarch64:

```
f4:
        ld2     {v0.4s - v1.4s}, [x1]
        str     q0, [x0]
        str     q1, [x0, 16]
        ret
f5:
        str     q0, [x0]
        str     q1, [x0, 16]
        ret
```

For arm, it seems that f5 could follow f4 and uses a `vst1.64 {d0-d3}, [r0:64]`
instead. For aarch64, both function should have used a `stp q0, q1, [x0]`

Clang produces what I expected on aarch64 but it only uses pair store
instruction on arm, which use one more instuction for `f4` and one fewer for
`f5`. (I'm not sure why GCC decided to use a pair store and then two single
stores....)

Similar to pr89606, this optimization should at least happen with `-Os` if not
for all other optimization levels.

Tested with 8.2.1 on arm and 8.3.0 on aarch64.

Reply via email to