https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116825
Bug ID: 116825
Summary: aarch64: unnecessary vector perm combination
Product: gcc
Version: unknown
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: fxue at os dot amperecomputing.com
Target Milestone: ---
For the case as:
#include <arm_neon.h>
typedef unsigned char v16qi __attribute__ ((vector_size (16)));
void foo(v16qi v0, v16qi v1, v16qi *result)
{
v16qi t0 = vuzp1q_u8(v0, v1);
v16qi t1 = vuzp1q_u8(t0, t0);
*result = t1;
}
Two simple "uzp1" perms are combined together, but the resulted perm is
irregular regarding to aarch64 ISA, so it has to be mapped to an inefficient
"tbl" instruction that needs an extra load to fetch "vector shuffle indices".
adrp x1, .LC0
ldr q31, [x1, #:lo12:.LC0] # vector shuffle indices
tbl v0.16b, {v0.16b - v1.16b}, v31.16b
str q0, [x0]
Actually, codegen could be simple as:
uzp1 v0.16b, v0.16b, v1.16b
uzp1 v0.16b, v0.16b, v0.16b
str q0, [x0]