https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89614
Bug ID: 89614 Summary: Missing optimization for store of multiple registers on arm Product: gcc Version: 8.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: yyc1992 at gmail dot com Target Milestone: --- Separated from pr89607 as requested. Test code and result compiled with any non-zero optimization levels, ``` #include <arm_neon.h> void f4(float32x4x2_t *p, const float *p1) { *p = vld2q_f32(p1); } void f5(float32x4x2_t *p, float32x4_t v1, float32x4_t v2) { p->val[0] = v1; p->val[1] = v2; } ``` ``` f4: vld2.32 {d16-d19}, [r1] vst1.64 {d16-d19}, [r0:64] bx lr f5: vst1.64 {d0-d1}, [r0:64] vstr d2, [r0, #16] vstr d3, [r0, #24] bx lr ``` I believe `f5` should use a single `vst1.64 {d0-d3}, [r0:64]` just like `f4`. If for some reason doing that is bad for performance (doubt it...) it should at least be used for -Os.