https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115531
Bug ID: 115531
Summary: vectorizer generates inefficient code for masked
conditional update loops
Product: gcc
Version: 15.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: tnfchris at gcc dot gnu.org
Target Milestone: ---
The following code:
void __attribute__((noipa))
foo (char *restrict a, int *restrict b, int *restrict c, int n, int stride)
{
if (stride <= 1)
return;
for (int i = 0; i < n; i++)
{
int res = c[i];
int t = b[i+stride];
if (a[i] != 0)
res = t;
c[i] = res;
}
}
generates at -O3 -g0 -mcpu=generic+sve:
.L3:
ld1b z29.s, p7/z, [x0, x5]
ld1w z31.s, p7/z, [x2, x5, lsl 2]
ld1w z30.s, p7/z, [x1, x5, lsl 2]
cmpne p15.b, p6/z, z29.b, #0
sel z30.s, p15, z30.s, z31.s
st1w z30.s, p7, [x2, x5, lsl 2]
add x5, x5, x4
whilelo p7.s, w5, w3
b.any .L3
.L1:
and makes vectorization unprofitable until very high iterations of n.
This is because the vector code has more instructions than needed.
Since it's a masked store, whenever a value is being conditionally set we don't
need the intermediate VEC_COND_EXPR. This loop can be vectorized as:
.L3:
ld1b z29.s, p7/z, [x0, x5]
ld1w z31.s, p7/z, [x2, x5, lsl 2]
cmpne p4.b, p6/z, z29.b, #0
st1w z31.s, p4, [x2, x5, lsl 2]
add x5, x5, x4
whilelo p7.s, w5, w3
b.any .L3
.L1:
I currently prototyped a load-to-store forward optimization in forwprop but
looking to move it into the vectorizer to cost it properly, however I'm not
entirely sure what the best way to do so is.
I can certainly fix it up during codegen but to cost it I need to do so during
analysis. I could detect it during vectorizable_condition but then the dead
load is still costed. Or I could maybe use a pattern, but unsure how to
represent the mask into the load.
Is it valid to produce a pattern with .IFN_MASK_STORE?