https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121952
--- Comment #7 from Andrew Pinski <pinskia at gcc dot gnu.org> ---
LLVM produces this for the loop (with -O2 -fno-vectorize):
.LBB0_2:
and w12, w9, #0x1
subs x11, x11, #1
sub x9, x9, #1
strb w12, [x10], #1
b.ne .LBB0_2
But really it should just:
.LBB0_2:
subs x11, x11, #1
strb w9, [x10], #1 ;; post increment
xor w9, w9, #1
b.ne .LBB0_2
GCC produces this:
.L3:
add w1, w3, w0
and w1, w1, 1
strb w1, [x0, 1]! // pre increment
cmp x0, x2
bne .L3
But that is because the IV is the pointer and adding and offset w3.
Also LLVM vectorization is worse:
```
.LBB0_6:
sub v20.2d, v0.2d, v18.2d
sub v24.2d, v0.2d, v4.2d
subs x12, x12, #16
sub v21.2d, v0.2d, v7.2d
sub v25.2d, v0.2d, v3.2d
add v7.2d, v7.2d, v17.2d
sub v22.2d, v0.2d, v6.2d
sub v26.2d, v0.2d, v2.2d
add v6.2d, v6.2d, v17.2d
sub v23.2d, v0.2d, v5.2d
sub v27.2d, v0.2d, v1.2d
add v18.2d, v18.2d, v17.2d
add v5.2d, v5.2d, v17.2d
add v4.2d, v4.2d, v17.2d
add v3.2d, v3.2d, v17.2d
add v2.2d, v2.2d, v17.2d
add v1.2d, v1.2d, v17.2d
tbl v20.16b, { v20.16b, v21.16b, v22.16b, v23.16b }, v19.16b
tbl v21.16b, { v24.16b, v25.16b, v26.16b, v27.16b }, v19.16b
mov v20.d[1], v21.d[0]
and v20.16b, v20.16b, v16.16b
str q20, [x11], #16
```
99% due to add/sub still being inside the loop and not figuring out this is
just `^1` for the induction variable in the end.