https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88838
Bug ID: 88838
Summary: [SVE] Use 32-bit WHILELO in LP64 mode
Product: gcc
Version: unknown
Status: UNCONFIRMED
Severity: enhancement
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: rsandifo at gcc dot gnu.org
Target Milestone: ---
Compiling this test with -O3 -march=armv8-a+sve:
void
f (int *restrict x, int *restrict y, int *restrict z, int n)
{
for (int i = 0; i < n; i += 1)
x[i] = y[i] + z[i];
}
produces:
f:
.LFB0:
.cfi_startproc
cmp w3, 0
ble .L1
mov x4, 0
sxtw x3, w3
whilelo p0.s, xzr, x3
.p2align 3,,7
.L3:
ld1w z1.s, p0/z, [x1, x4, lsl 2]
ld1w z0.s, p0/z, [x2, x4, lsl 2]
add z0.s, z0.s, z1.s
st1w z0.s, p0, [x0, x4, lsl 2]
incw x4
whilelo p0.s, x4, x3
bne .L3
.L1:
ret
We could (and should) avoid the SXTW by using WHILELO on W registers instead of
X registers.
vect_verify_full_masking checks which IV widths are supported for WHILELO but
prefers to go to Pmode width. This is because using Pmode allows ivopts to
reuse the IV for indices (as in the loads and store above). However, it would
be better to use a 32-bit WHILELO with a truncated 64-bit IV if:
(a) the limit is extended from 32 bits.
(b) the detection loop in vect_verify_full_masking detects that using a 32-bit
IV would be correct.
The thing to avoid is when using a 32-bit IV might wrap (see
vect_set_loop_masks_directly). In that case it would be better to stick with
64-bit WHILELOs.