https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115531

            Bug ID: 115531
           Summary: vectorizer generates inefficient code for masked
                    conditional update loops
           Product: gcc
           Version: 15.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: tnfchris at gcc dot gnu.org
  Target Milestone: ---

The following code:

void __attribute__((noipa))
foo (char *restrict a, int *restrict b, int *restrict c, int n, int stride)
{
  if (stride <= 1)
    return;

  for (int i = 0; i < n; i++)
    {
      int res = c[i];
      int t = b[i+stride];
      if (a[i] != 0)
        res = t;
      c[i] = res;
    }
}

generates at -O3 -g0 -mcpu=generic+sve:

.L3:
        ld1b    z29.s, p7/z, [x0, x5]
        ld1w    z31.s, p7/z, [x2, x5, lsl 2]
        ld1w    z30.s, p7/z, [x1, x5, lsl 2]
        cmpne   p15.b, p6/z, z29.b, #0
        sel     z30.s, p15, z30.s, z31.s
        st1w    z30.s, p7, [x2, x5, lsl 2]
        add     x5, x5, x4
        whilelo p7.s, w5, w3
        b.any   .L3
.L1:

and makes vectorization unprofitable until very high iterations of n.
This is because the vector code has more instructions than needed.

Since it's a masked store, whenever a value is being conditionally set we don't
need the intermediate VEC_COND_EXPR.  This loop can be vectorized as:

.L3:
        ld1b    z29.s, p7/z, [x0, x5]
        ld1w    z31.s, p7/z, [x2, x5, lsl 2]
        cmpne   p4.b, p6/z, z29.b, #0
        st1w    z31.s, p4, [x2, x5, lsl 2]
        add     x5, x5, x4
        whilelo p7.s, w5, w3
        b.any   .L3
.L1:

I currently prototyped a load-to-store forward optimization in forwprop but
looking to move it into the vectorizer to cost it properly, however I'm not
entirely sure what the best way to do so is.

I can certainly fix it up during codegen but to cost it I need to do so during
analysis. I could detect it during vectorizable_condition but then the dead
load is still costed. Or I could maybe use a pattern, but unsure how to
represent the mask into the load.

Is it valid to produce a pattern with .IFN_MASK_STORE?

Reply via email to