https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116825
Bug ID: 116825 Summary: aarch64: unnecessary vector perm combination Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: fxue at os dot amperecomputing.com Target Milestone: --- For the case as: #include <arm_neon.h> typedef unsigned char v16qi __attribute__ ((vector_size (16))); void foo(v16qi v0, v16qi v1, v16qi *result) { v16qi t0 = vuzp1q_u8(v0, v1); v16qi t1 = vuzp1q_u8(t0, t0); *result = t1; } Two simple "uzp1" perms are combined together, but the resulted perm is irregular regarding to aarch64 ISA, so it has to be mapped to an inefficient "tbl" instruction that needs an extra load to fetch "vector shuffle indices". adrp x1, .LC0 ldr q31, [x1, #:lo12:.LC0] # vector shuffle indices tbl v0.16b, {v0.16b - v1.16b}, v31.16b str q0, [x0] Actually, codegen could be simple as: uzp1 v0.16b, v0.16b, v1.16b uzp1 v0.16b, v0.16b, v0.16b str q0, [x0]