https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89607
Bug ID: 89607 Summary: Missing optimization for store of multiple registers on arm and aarch64 Product: gcc Version: 8.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: yyc1992 at gmail dot com Target Milestone: --- Test code, Compiled for arm/aarch64 with -O1/-O2/-O3/-Os/-Ofast ``` #include <arm_neon.h> void f4(float32x4x2_t *p, const float *p1) { *p = vld2q_f32(p1); } void f5(float32x4x2_t *p, float32x4_t v1, float32x4_t v2) { p->val[0] = v1; p->val[1] = v2; } ``` arm: ``` f4: vld2.32 {d16-d19}, [r1] vst1.64 {d16-d19}, [r0:64] bx lr f5: vst1.64 {d0-d1}, [r0:64] vstr d2, [r0, #16] vstr d3, [r0, #24] bx lr ``` aarch64: ``` f4: ld2 {v0.4s - v1.4s}, [x1] str q0, [x0] str q1, [x0, 16] ret f5: str q0, [x0] str q1, [x0, 16] ret ``` For arm, it seems that f5 could follow f4 and uses a `vst1.64 {d0-d3}, [r0:64]` instead. For aarch64, both function should have used a `stp q0, q1, [x0]` Clang produces what I expected on aarch64 but it only uses pair store instruction on arm, which use one more instuction for `f4` and one fewer for `f5`. (I'm not sure why GCC decided to use a pair store and then two single stores....) Similar to pr89606, this optimization should at least happen with `-Os` if not for all other optimization levels. Tested with 8.2.1 on arm and 8.3.0 on aarch64.