https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89606
Bug ID: 89606
Summary: Extra mov after structure load instructions on aarch64
Product: gcc
Version: 8.3.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: yyc1992 at gmail dot com
Target Milestone: ---
Code to reproduce,
```
#include <arm_neon.h>
#ifdef __aarch64__
float64x2x2_t f(const double *p1, const double *p2)
{
float64x2x2_t v = vld2q_f64(p1);
return vld2q_lane_f64(p2, v, 1);
}
float32x2x2_t f2(const float *p1, const float *p2)
{
float32x2x2_t v = vld2_f32(p1);
return vld2_lane_f32(p2, v, 1);
}
#endif
void f3(float32x2x2_t *p, const float *p1, const float *p2)
{
float32x2x2_t v = vld2_f32(p1);
*p = vld2_lane_f32(p2, v, 1);
}
```
GCC produces (aarch64, -O1/-O2/-O3/-Ofast/-Os),
```
f:
ld2 {v4.2d - v5.2d}, [x0]
mov v0.16b, v4.16b
mov v1.16b, v5.16b
ld2 {v0.d - v1.d}[1], [x1]
ret
f2:
ld2 {v0.2s - v1.2s}, [x0]
mov v2.8b, v0.8b
mov v3.8b, v1.8b
ld2 {v2.s - v3.s}[1], [x1]
mov v1.8b, v3.8b
mov v0.8b, v2.8b
ret
f3:
ld2 {v2.2s - v3.2s}, [x1]
mov v0.8b, v2.8b
mov v1.8b, v3.8b
ld2 {v0.s - v1.s}[1], [x2]
stp d0, d1, [x0]
ret
```
For all three functions, none of the mov's seems necessary. Even if there's
some performance issue when reusing the registers (I highly doubt it...) at
least the `-Os` version should not have those mov's.
Clang produces what I expect in this case,
```
f:
ld2 { v0.2d, v1.2d }, [x0]
ld2 { v0.d, v1.d }[1], [x1]
ret
f2:
ld2 { v0.2s, v1.2s }, [x0]
ld2 { v0.s, v1.s }[1], [x1]
ret
f3:
ld2 { v0.2s, v1.2s }, [x1]
ld2 { v0.s, v1.s }[1], [x2]
stp d0, d1, [x0]
ret
```
Aarch32 doesn't have this issue either with GCC,
```
f3:
vld2.32 {d16-d17}, [r1]
vld2.32 {d16[1], d17[1]}, [r2]
vst1.64 {d16-d17}, [r0:64]
bx lr
```
so this seems to be aarch64 specific.