https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89614

            Bug ID: 89614
           Summary: Missing optimization for store of multiple registers
                    on arm
           Product: gcc
           Version: 8.2.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: yyc1992 at gmail dot com
  Target Milestone: ---

Separated from pr89607 as requested. Test code and result compiled with any
non-zero optimization levels,

```
#include <arm_neon.h>

void f4(float32x4x2_t *p, const float *p1)
{
    *p = vld2q_f32(p1);
}

void f5(float32x4x2_t *p, float32x4_t v1, float32x4_t v2)
{
    p->val[0] = v1;
    p->val[1] = v2;
}
```

```
f4:
        vld2.32 {d16-d19}, [r1]
        vst1.64 {d16-d19}, [r0:64]
        bx      lr
f5:
        vst1.64 {d0-d1}, [r0:64]
        vstr    d2, [r0, #16]
        vstr    d3, [r0, #24]
        bx      lr
```

I believe `f5` should use a single `vst1.64 {d0-d3}, [r0:64]` just like `f4`.

If for some reason doing that is bad for performance (doubt it...) it should at
least be used for -Os.

Reply via email to