https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96463

            Bug ID: 96463
           Summary: [SVE] Optimise svld1rq from vectors
           Product: gcc
           Version: 11.0
            Status: UNCONFIRMED
          Severity: enhancement
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: rsandifo at gcc dot gnu.org
  Target Milestone: ---
            Target: aarch64*-*-*

The code:

#include <arm_sve.h>
#include <arm_neon.h>

svint32_t
foo (int32x4_t x)
{
  return svld1rq (svptrue_b8 (), &x[0]);
}

currently generates:

        sub     sp, sp, #16
        ptrue   p0.b, all
        str     q0, [sp]
        ld1rqw  z0.s, p0/z, [sp]
        add     sp, sp, 16
        ret

but we should instead be able to generate:

        dup     z0.q, z0.q[0]

(at least on little-endian targets).  Perhaps svld1rq_impl should
lower the call to a VEC_PERM_EXPR if the argument is based on a
vector.  (And perhaps more generally, although that would need
testing.)

Reply via email to