https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88838

            Bug ID: 88838
           Summary: [SVE] Use 32-bit WHILELO in LP64 mode
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Severity: enhancement
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: rsandifo at gcc dot gnu.org
  Target Milestone: ---

Compiling this test with -O3 -march=armv8-a+sve:

void
f (int *restrict x, int *restrict y, int *restrict z, int n)
{
  for (int i = 0; i < n; i += 1)
    x[i] = y[i] + z[i];
}

produces:

f:
.LFB0:
        .cfi_startproc
        cmp     w3, 0
        ble     .L1
        mov     x4, 0
        sxtw    x3, w3
        whilelo p0.s, xzr, x3
        .p2align 3,,7
.L3:
        ld1w    z1.s, p0/z, [x1, x4, lsl 2]
        ld1w    z0.s, p0/z, [x2, x4, lsl 2]
        add     z0.s, z0.s, z1.s
        st1w    z0.s, p0, [x0, x4, lsl 2]
        incw    x4
        whilelo p0.s, x4, x3
        bne     .L3
.L1:
        ret

We could (and should) avoid the SXTW by using WHILELO on W registers instead of
X registers.

vect_verify_full_masking checks which IV widths are supported for WHILELO but
prefers to go to Pmode width.  This is because using Pmode allows ivopts to
reuse the IV for indices (as in the loads and store above).  However, it would
be better to use a 32-bit WHILELO with a truncated 64-bit IV if:

(a) the limit is extended from 32 bits.

(b) the detection loop in vect_verify_full_masking detects that using a 32-bit
IV would be correct.

The thing to avoid is when using a 32-bit IV might wrap (see
vect_set_loop_masks_directly).  In that case it would be better to stick with
64-bit WHILELOs.

Reply via email to