https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119577

            Bug ID: 119577
           Summary: RISC-V: Redundant vector IV roundtrip.
           Product: gcc
           Version: 15.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: enhancement
          Priority: P3
         Component: middle-end
          Assignee: unassigned at gcc dot gnu.org
          Reporter: rdapp at gcc dot gnu.org
  Target Milestone: ---
            Target: riscv

Given this function (basically vect-early-break_133_pfa1.c):

#define SZ 1020

char string[SZ];

__attribute__ ((noipa))
char * find(int n, char c)
{
    for (int i = 1; i < n; i++) {
        if (string[i] == c)
            return &string[i];
    }
    return 0;
}

we generate the following gimple:

  <bb 5> [local count: 913217421]:
  # vect_vec_iv_.7_47 = PHI <_51(8), { 1, 2, 3, ... }(4)>
  # vectp_string.8_57 = PHI <vectp_string.8_58(8), &MEM <char[1020]> [(void
*)&string + 1B](4)>
  # ivtmp_65 = PHI <ivtmp_66(8), _64(4)>
  _67 = .SELECT_VL (ivtmp_65, POLY_INT_CST [16, 16]);
  vect__1.10_60 = .MASK_LEN_LOAD (vectp_string.8_57, 8B, { -1, ... }, _59(D),
_67, 0);
  mask_patt_15.11_61 = _54 == vect__1.10_60;
  vec_len_mask_62 = .VCOND_MASK_LEN (mask_patt_15.11_61, { -1, ... }, { 0, ...
}, _67, 0);
  if (vec_len_mask_62 != { 0, ... })
    goto <bb 6>; [5.50%]
  else
    goto <bb 8>; [94.50%]

  <bb 6> [local count: 97691432]:
  _53 = BIT_FIELD_REF <vect_vec_iv_.7_47, 32, 0>;
  goto <bb 9>; [100.00%]

  <bb 7> [local count: 55807731]:
  _2 = (sizetype) i_35;
  _8 = &string + _2;
  goto <bb 12>; [100.00%]

  <bb 8> [local count: 862990464]:
  _49 = (int) _67;
  _50 = [vec_duplicate_expr] _49;
  _51 = vect_vec_iv_.7_47 + _50;
  vectp_string.8_58 = vectp_string.8_57 + _67;
  ivtmp_66 = ivtmp_65 - _67;
  if (ivtmp_66 != 0)
    goto <bb 5>; [94.50%]
  else
    goto <bb 12>; [5.50%]

In itself that's reasonable.  However, with the function returning string + i
we need the induction variable and obtain it via vectorizable_live_operation
from 
vect_vec_iv_.7_47.

That results in the following annotated assembly:

        vid.v   v2        # v2 = {0, 1, ...}
        vadd.vi v2,v2,1   # v2 = {1, 2, ...}
.L4:
        [..]
        vadd.vv v2,v2,v3  # v2 += vector length in current iter.
        beq     a4,zero,.L8
.L5:
        vsetvli a5,a4,e8,mf4,ta,ma  # a5 = vector length in current iter
        [..]
        vmv.v.x v3,a5  # broadcast vector length to vector
        [..]
        beq a5,zero,.L4 # latch
        vmv.x.s a5,v2  # move last vector length to GPR.
.L3:
        add     a4,a7,a5 # use last vector length
        j       .L7
.L7
        [..] # read string[a4] and return.

I could somehow imagine that for aarch64, i.e. without length control we need
to
keep the vector length in a vector register but for riscv it seems pretty
pointless in this example.  On top we're paying the cost of broadcasting to
vector, adding vectors in every iteration, and, once, moving back to scalar
when all of it could be done "for free" on the scalar side.

What's a reasonable way to avoid this?  Can we special case the vector IV
generation when it's only used in a simple way like here?  The above example is
simple but we're also seeing related behavior for non early-break loops. 
Sometimes we even create a new vec_series in every iteration in order to just
add it to the number of iterations.  The vec_series being a VLA operation we
can't hoist it out of the loop.

Reply via email to