https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119577
Bug ID: 119577 Summary: RISC-V: Redundant vector IV roundtrip. Product: gcc Version: 15.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: enhancement Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: rdapp at gcc dot gnu.org Target Milestone: --- Target: riscv Given this function (basically vect-early-break_133_pfa1.c): #define SZ 1020 char string[SZ]; __attribute__ ((noipa)) char * find(int n, char c) { for (int i = 1; i < n; i++) { if (string[i] == c) return &string[i]; } return 0; } we generate the following gimple: <bb 5> [local count: 913217421]: # vect_vec_iv_.7_47 = PHI <_51(8), { 1, 2, 3, ... }(4)> # vectp_string.8_57 = PHI <vectp_string.8_58(8), &MEM <char[1020]> [(void *)&string + 1B](4)> # ivtmp_65 = PHI <ivtmp_66(8), _64(4)> _67 = .SELECT_VL (ivtmp_65, POLY_INT_CST [16, 16]); vect__1.10_60 = .MASK_LEN_LOAD (vectp_string.8_57, 8B, { -1, ... }, _59(D), _67, 0); mask_patt_15.11_61 = _54 == vect__1.10_60; vec_len_mask_62 = .VCOND_MASK_LEN (mask_patt_15.11_61, { -1, ... }, { 0, ... }, _67, 0); if (vec_len_mask_62 != { 0, ... }) goto <bb 6>; [5.50%] else goto <bb 8>; [94.50%] <bb 6> [local count: 97691432]: _53 = BIT_FIELD_REF <vect_vec_iv_.7_47, 32, 0>; goto <bb 9>; [100.00%] <bb 7> [local count: 55807731]: _2 = (sizetype) i_35; _8 = &string + _2; goto <bb 12>; [100.00%] <bb 8> [local count: 862990464]: _49 = (int) _67; _50 = [vec_duplicate_expr] _49; _51 = vect_vec_iv_.7_47 + _50; vectp_string.8_58 = vectp_string.8_57 + _67; ivtmp_66 = ivtmp_65 - _67; if (ivtmp_66 != 0) goto <bb 5>; [94.50%] else goto <bb 12>; [5.50%] In itself that's reasonable. However, with the function returning string + i we need the induction variable and obtain it via vectorizable_live_operation from vect_vec_iv_.7_47. That results in the following annotated assembly: vid.v v2 # v2 = {0, 1, ...} vadd.vi v2,v2,1 # v2 = {1, 2, ...} .L4: [..] vadd.vv v2,v2,v3 # v2 += vector length in current iter. beq a4,zero,.L8 .L5: vsetvli a5,a4,e8,mf4,ta,ma # a5 = vector length in current iter [..] vmv.v.x v3,a5 # broadcast vector length to vector [..] beq a5,zero,.L4 # latch vmv.x.s a5,v2 # move last vector length to GPR. .L3: add a4,a7,a5 # use last vector length j .L7 .L7 [..] # read string[a4] and return. I could somehow imagine that for aarch64, i.e. without length control we need to keep the vector length in a vector register but for riscv it seems pretty pointless in this example. On top we're paying the cost of broadcasting to vector, adding vectors in every iteration, and, once, moving back to scalar when all of it could be done "for free" on the scalar side. What's a reasonable way to avoid this? Can we special case the vector IV generation when it's only used in a simple way like here? The above example is simple but we're also seeing related behavior for non early-break loops. Sometimes we even create a new vec_series in every iteration in order to just add it to the number of iterations. The vec_series being a VLA operation we can't hoist it out of the loop.