https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98792
Bug ID: 98792
Summary: Fail to use SHRN instructions for narrowing shift on
aarch64
Product: gcc
Version: unknown
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: ktkachov at gcc dot gnu.org
Target Milestone: ---
Target: aarch64
#define N 1024
unsigned short res[N];
unsigned int in[N];
void
foo (void)
{
for (int i = 0; i < N; i++)
res[i] = in[i] >> 3;
}
with -O3 -mcpu=neoverse-n1 on aarch64 generates the loop:
.L2:
ldp q1, q0, [x0]
add x0, x0, 32
ushr v1.4s, v1.4s, 3
ushr v0.4s, v0.4s, 3
xtn v2.4h, v1.4s
xtn2 v2.8h, v0.4s
str q2, [x1], 16
cmp x0, x2
bne .L2
it could be using the SHRN narrowing shift instruction insted. LLVM can do it
(some other inefficiencies aside):
.LBB0_1: // %vector.body
// =>This Inner Loop Header: Depth=1
add x11, x10, x8
ldp q0, q1, [x11]
add x8, x8, #32 // =32
cmp x8, #1, lsl #12 // =4096
shrn v0.4h, v0.4s, #3
shrn v1.4h, v1.4s, #3
stp d0, d1, [x9, #-8]
add x9, x9, #16 // =16
b.ne .LBB0_1
Some backend patterns can probably handle it, but maybe the vectoriser can do
something useful earlier as well?