https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115531
Bug ID: 115531 Summary: vectorizer generates inefficient code for masked conditional update loops Product: gcc Version: 15.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: tnfchris at gcc dot gnu.org Target Milestone: --- The following code: void __attribute__((noipa)) foo (char *restrict a, int *restrict b, int *restrict c, int n, int stride) { if (stride <= 1) return; for (int i = 0; i < n; i++) { int res = c[i]; int t = b[i+stride]; if (a[i] != 0) res = t; c[i] = res; } } generates at -O3 -g0 -mcpu=generic+sve: .L3: ld1b z29.s, p7/z, [x0, x5] ld1w z31.s, p7/z, [x2, x5, lsl 2] ld1w z30.s, p7/z, [x1, x5, lsl 2] cmpne p15.b, p6/z, z29.b, #0 sel z30.s, p15, z30.s, z31.s st1w z30.s, p7, [x2, x5, lsl 2] add x5, x5, x4 whilelo p7.s, w5, w3 b.any .L3 .L1: and makes vectorization unprofitable until very high iterations of n. This is because the vector code has more instructions than needed. Since it's a masked store, whenever a value is being conditionally set we don't need the intermediate VEC_COND_EXPR. This loop can be vectorized as: .L3: ld1b z29.s, p7/z, [x0, x5] ld1w z31.s, p7/z, [x2, x5, lsl 2] cmpne p4.b, p6/z, z29.b, #0 st1w z31.s, p4, [x2, x5, lsl 2] add x5, x5, x4 whilelo p7.s, w5, w3 b.any .L3 .L1: I currently prototyped a load-to-store forward optimization in forwprop but looking to move it into the vectorizer to cost it properly, however I'm not entirely sure what the best way to do so is. I can certainly fix it up during codegen but to cost it I need to do so during analysis. I could detect it during vectorizable_condition but then the dead load is still costed. Or I could maybe use a pattern, but unsure how to represent the mask into the load. Is it valid to produce a pattern with .IFN_MASK_STORE?