https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98792

            Bug ID: 98792
           Summary: Fail to use SHRN instructions for narrowing shift on
                    aarch64
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: ktkachov at gcc dot gnu.org
  Target Milestone: ---
            Target: aarch64

#define N 1024
unsigned short res[N];
unsigned int in[N];

void
foo (void)
{
  for (int i = 0; i < N; i++)
    res[i] = in[i] >> 3;
}

with -O3 -mcpu=neoverse-n1 on aarch64 generates the loop:
.L2:
        ldp     q1, q0, [x0]
        add     x0, x0, 32
        ushr    v1.4s, v1.4s, 3
        ushr    v0.4s, v0.4s, 3
        xtn     v2.4h, v1.4s
        xtn2    v2.8h, v0.4s
        str     q2, [x1], 16
        cmp     x0, x2
        bne     .L2

it could be using the SHRN narrowing shift instruction insted. LLVM can do it
(some other inefficiencies aside):
.LBB0_1:                                // %vector.body
                                        // =>This Inner Loop Header: Depth=1
        add     x11, x10, x8
        ldp     q0, q1, [x11]
        add     x8, x8, #32                     // =32
        cmp     x8, #1, lsl #12                 // =4096
        shrn    v0.4h, v0.4s, #3
        shrn    v1.4h, v1.4s, #3
        stp     d0, d1, [x9, #-8]
        add     x9, x9, #16                     // =16
        b.ne    .LBB0_1

Some backend patterns can probably handle it, but maybe the vectoriser can do
something useful earlier as well?

Reply via email to