https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115120
Bug ID: 115120 Summary: Bad interaction between ivcanon and early break vectorization Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: acoplan at gcc dot gnu.org Target Milestone: --- Consider the following testcase on aarch64: int arr[1024]; int *f() { int i; for (i = 0; i < 1024; i++) if (arr[i] == 42) break; return arr + i; } compiled with -O3 we get the following vector loop body: .L2: cmp x2, x1 beq .L9 .L6: ldr q31, [x1] add x1, x1, 16 mov v27.16b, v29.16b mov v28.16b, v30.16b cmeq v31.4s, v31.4s, v26.4s add v29.4s, v29.4s, v24.4s add v30.4s, v30.4s, v25.4s umaxp v31.4s, v31.4s, v31.4s fmov x3, d31 cbz x3, .L2 it's somewhat surprising that there are two vector adds, looking at the optimized dump: <bb 3> [local count: 1063004408]: # vect_vec_iv_.6_28 = PHI <_29(10), { 0, 1, 2, 3 }(2)> # vect_vec_iv_.7_33 = PHI <_34(10), { 1024, 1023, 1022, 1021 }(2)> # ivtmp.18_19 = PHI <ivtmp.18_20(10), ivtmp.18_26(2)> _34 = vect_vec_iv_.7_33 + { 4294967292, 4294967292, 4294967292, 4294967292 }; _29 = vect_vec_iv_.6_28 + { 4, 4, 4, 4 }; _25 = (void *) ivtmp.18_19; vect__1.10_39 = MEM <vector(4) int> [(int *)_25]; mask_patt_9.11_41 = vect__1.10_39 == { 42, 42, 42, 42 }; if (mask_patt_9.11_41 != { 0, 0, 0, 0 }) goto <bb 4>; [5.50%] else goto <bb 10>; [94.50%] we can see that there are two IV updates that got vectorized. It turns out that one of these comes from the ivcanon pass. If I add -fno-tree-loop-ivcanon we instead get the following vector loop body: .L2: cmp x2, x1 beq .L9 .L6: ldr q31, [x1] add x1, x1, 16 mov v29.16b, v30.16b add v30.4s, v30.4s, v27.4s cmeq v31.4s, v31.4s, v28.4s umaxp v31.4s, v31.4s, v31.4s fmov x3, d31 cbz x3, .L2 which is much cleaner. Looking at the tree dumps, the ivcanon pass makes the following transformation: --- cddce2.tree 2024-05-16 13:49:10.426703350 +0000 +++ ivcanon.tree 2024-05-16 13:49:17.678874925 +0000 @@ -4,6 +4,8 @@ int i; int _1; int * _8; + unsigned int ivtmp_11; + unsigned int ivtmp_12; long unsigned int _13; long unsigned int _15; long unsigned int prephitmp_16; @@ -12,6 +14,7 @@ <bb 3> [local count: 1063004408]: # i_10 = PHI <i_7(7), 0(2)> + # ivtmp_12 = PHI <ivtmp_11(7), 1024(2)> _1 = arr[i_10]; if (_1 == 42) goto <bb 5>; [5.50%] @@ -20,7 +23,8 @@ <bb 4> [local count: 1004539166]: i_7 = i_10 + 1; - if (i_7 != 1024) + ivtmp_11 = ivtmp_12 - 1; + if (ivtmp_11 != 0) goto <bb 7>; [98.93%] else goto <bb 8>; [1.07%] i.e. it introduces the backwards-counting IV. It seems in the general case without vectorization ivopts then cleans this up and ensures we only have a single IV. In the vectorized case it seems this problem only shows up with early break vectorization. Looking at a simple reduction, such as: int a[1024]; int g() { int sum = 0; for (int i = 0; i < 1024; i++) sum += a[i]; return sum; } although we still have the backwards-counting IV in ifcvt: <bb 3> [local count: 1063004408]: # sum_9 = PHI <sum_5(5), 0(2)> # i_11 = PHI <i_6(5), 0(2)> # ivtmp_8 = PHI <ivtmp_7(5), 1024(2)> _1 = a[i_11]; sum_5 = _1 + sum_9; i_6 = i_11 + 1; ivtmp_7 = ivtmp_8 - 1; if (ivtmp_7 != 0) goto <bb 5>; [98.99%] else goto <bb 4>; [1.01%] we end up with only scalar IVs after vectorization, and the backwards scalar IV ends up getting deleted by dce6: Deleting : ivtmp_7 = ivtmp_8 - 1; I'm not sure what the right solution is but we should avoid having duplicated IVs with early break vectorization.