https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116825

            Bug ID: 116825
           Summary: aarch64: unnecessary vector perm combination
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: fxue at os dot amperecomputing.com
  Target Milestone: ---

For the case as:

#include <arm_neon.h>

typedef unsigned char v16qi __attribute__ ((vector_size (16)));

void foo(v16qi v0, v16qi v1, v16qi *result)
{
     v16qi t0 = vuzp1q_u8(v0, v1);
     v16qi t1 = vuzp1q_u8(t0, t0);

     *result = t1;
}

Two simple "uzp1" perms are combined together, but the resulted perm is
irregular regarding to aarch64 ISA, so it has to be mapped to an inefficient
"tbl" instruction that needs an extra load to fetch "vector shuffle indices".

        adrp    x1, .LC0
        ldr     q31, [x1, #:lo12:.LC0]              # vector shuffle indices
        tbl     v0.16b, {v0.16b - v1.16b}, v31.16b  
        str     q0, [x0]

Actually, codegen could be simple as:

        uzp1    v0.16b, v0.16b, v1.16b
        uzp1    v0.16b, v0.16b, v0.16b
        str     q0, [x0]

Reply via email to