https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96463
Bug ID: 96463 Summary: [SVE] Optimise svld1rq from vectors Product: gcc Version: 11.0 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: rsandifo at gcc dot gnu.org Target Milestone: --- Target: aarch64*-*-* The code: #include <arm_sve.h> #include <arm_neon.h> svint32_t foo (int32x4_t x) { return svld1rq (svptrue_b8 (), &x[0]); } currently generates: sub sp, sp, #16 ptrue p0.b, all str q0, [sp] ld1rqw z0.s, p0/z, [sp] add sp, sp, 16 ret but we should instead be able to generate: dup z0.q, z0.q[0] (at least on little-endian targets). Perhaps svld1rq_impl should lower the call to a VEC_PERM_EXPR if the argument is based on a vector. (And perhaps more generally, although that would need testing.)