https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98792
Bug ID: 98792 Summary: Fail to use SHRN instructions for narrowing shift on aarch64 Product: gcc Version: unknown Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: ktkachov at gcc dot gnu.org Target Milestone: --- Target: aarch64 #define N 1024 unsigned short res[N]; unsigned int in[N]; void foo (void) { for (int i = 0; i < N; i++) res[i] = in[i] >> 3; } with -O3 -mcpu=neoverse-n1 on aarch64 generates the loop: .L2: ldp q1, q0, [x0] add x0, x0, 32 ushr v1.4s, v1.4s, 3 ushr v0.4s, v0.4s, 3 xtn v2.4h, v1.4s xtn2 v2.8h, v0.4s str q2, [x1], 16 cmp x0, x2 bne .L2 it could be using the SHRN narrowing shift instruction insted. LLVM can do it (some other inefficiencies aside): .LBB0_1: // %vector.body // =>This Inner Loop Header: Depth=1 add x11, x10, x8 ldp q0, q1, [x11] add x8, x8, #32 // =32 cmp x8, #1, lsl #12 // =4096 shrn v0.4h, v0.4s, #3 shrn v1.4h, v1.4s, #3 stp d0, d1, [x9, #-8] add x9, x9, #16 // =16 b.ne .LBB0_1 Some backend patterns can probably handle it, but maybe the vectoriser can do something useful earlier as well?