https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115833
--- Comment #5 from Andrew Pinski <pinskia at gcc dot gnu.org> --- >It seems the very bad code generation is mostly from constructing the V4HImode vectors going via GPRs with shifts and ORs. On x86_64 that is true but aarch64 that is definitely not true: fmov s31, w1 add x1, x0, 2 dup v30.4h, v31.h[0] ld1 {v31.h}[1], [x1] add x1, x0, 4 ld1 {v31.h}[2], [x1] add x1, x0, 6 ld1 {v30.h}[0], [x0] ld1 {v31.h}[3], [x1] mul v31.4h, v31.4h, v30.4h str d31, [x0] That is reasonible for `{t[0], tt, tt, tt}` (one dup, followed by one ld1) and `{tt, t[1], t[2], t[3]}` (one fmov followed by 3 ld1s) vector construction. But the whole thing could have been and should have been just: ``` ldr d31, [x0] fmov s7, w1 mul v31.4h, v31.4h, v7.h[0] str d31, [x0] ``` Note the dup is part of the mul instruction here.