https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115833
--- Comment #5 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
>It seems the very bad code generation is mostly from constructing the
V4HImode vectors going via GPRs with shifts and ORs.
On x86_64 that is true but aarch64 that is definitely not true:
fmov s31, w1
add x1, x0, 2
dup v30.4h, v31.h[0]
ld1 {v31.h}[1], [x1]
add x1, x0, 4
ld1 {v31.h}[2], [x1]
add x1, x0, 6
ld1 {v30.h}[0], [x0]
ld1 {v31.h}[3], [x1]
mul v31.4h, v31.4h, v30.4h
str d31, [x0]
That is reasonible for `{t[0], tt, tt, tt}` (one dup, followed by one ld1) and
`{tt, t[1], t[2], t[3]}` (one fmov followed by 3 ld1s) vector construction. But
the whole thing could have been and should have been just:
```
ldr d31, [x0]
fmov s7, w1
mul v31.4h, v31.4h, v7.h[0]
str d31, [x0]
```
Note the dup is part of the mul instruction here.