https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89607
Bug ID: 89607
Summary: Missing optimization for store of multiple registers
on arm and aarch64
Product: gcc
Version: 8.2.1
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: yyc1992 at gmail dot com
Target Milestone: ---
Test code, Compiled for arm/aarch64 with -O1/-O2/-O3/-Os/-Ofast
```
#include <arm_neon.h>
void f4(float32x4x2_t *p, const float *p1)
{
*p = vld2q_f32(p1);
}
void f5(float32x4x2_t *p, float32x4_t v1, float32x4_t v2)
{
p->val[0] = v1;
p->val[1] = v2;
}
```
arm:
```
f4:
vld2.32 {d16-d19}, [r1]
vst1.64 {d16-d19}, [r0:64]
bx lr
f5:
vst1.64 {d0-d1}, [r0:64]
vstr d2, [r0, #16]
vstr d3, [r0, #24]
bx lr
```
aarch64:
```
f4:
ld2 {v0.4s - v1.4s}, [x1]
str q0, [x0]
str q1, [x0, 16]
ret
f5:
str q0, [x0]
str q1, [x0, 16]
ret
```
For arm, it seems that f5 could follow f4 and uses a `vst1.64 {d0-d3}, [r0:64]`
instead. For aarch64, both function should have used a `stp q0, q1, [x0]`
Clang produces what I expected on aarch64 but it only uses pair store
instruction on arm, which use one more instuction for `f4` and one fewer for
`f5`. (I'm not sure why GCC decided to use a pair store and then two single
stores....)
Similar to pr89606, this optimization should at least happen with `-Os` if not
for all other optimization levels.
Tested with 8.2.1 on arm and 8.3.0 on aarch64.