https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115833

--- Comment #5 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
>It seems the very bad code generation is mostly from constructing the
V4HImode vectors going via GPRs with shifts and ORs.

On x86_64 that is true but aarch64 that is definitely not true:

        fmov    s31, w1
        add     x1, x0, 2
        dup     v30.4h, v31.h[0]
        ld1     {v31.h}[1], [x1]
        add     x1, x0, 4
        ld1     {v31.h}[2], [x1]
        add     x1, x0, 6
        ld1     {v30.h}[0], [x0]
        ld1     {v31.h}[3], [x1]
        mul     v31.4h, v31.4h, v30.4h
        str     d31, [x0]

That is reasonible for `{t[0], tt, tt, tt}` (one dup, followed by one ld1) and
`{tt, t[1], t[2], t[3]}` (one fmov followed by 3 ld1s) vector construction. But
the whole thing could have been and should have been just:
```
        ldr     d31, [x0]
        fmov    s7, w1
        mul     v31.4h, v31.4h, v7.h[0]
        str     d31, [x0]
```

Note the dup is part of the mul instruction here.

Reply via email to