https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96463
Bug ID: 96463
Summary: [SVE] Optimise svld1rq from vectors
Product: gcc
Version: 11.0
Status: UNCONFIRMED
Severity: enhancement
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: rsandifo at gcc dot gnu.org
Target Milestone: ---
Target: aarch64*-*-*
The code:
#include <arm_sve.h>
#include <arm_neon.h>
svint32_t
foo (int32x4_t x)
{
return svld1rq (svptrue_b8 (), &x[0]);
}
currently generates:
sub sp, sp, #16
ptrue p0.b, all
str q0, [sp]
ld1rqw z0.s, p0/z, [sp]
add sp, sp, 16
ret
but we should instead be able to generate:
dup z0.q, z0.q[0]
(at least on little-endian targets). Perhaps svld1rq_impl should
lower the call to a VEC_PERM_EXPR if the argument is based on a
vector. (And perhaps more generally, although that would need
testing.)