https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82518
--- Comment #45 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Note the vectorized loop is pretty much the same on arm little-endian,
# vect_vec_iv_.6_33 = PHI <{ 0, 1, 2, 3 }(4), vect_vec_iv_.6_34(5)>
# ivtmp.12_14 = PHI <ivtmp.12_51(4), ivtmp.12_23(5)>
vectp_p.8_37 = (int[8] *) ivtmp.12_14;
vect_vec_iv_.6_34 = vect_vec_iv_.6_33 + { 4, 4, 4, 4 };
vect__4.7_36 = vect_vec_iv_.6_33 + { 1, 1, 1, 1 };
vect_array.10[0] = vect_vec_iv_.6_33;
vect_array.10[1] = vect__4.7_36;
MEM[(int *)vectp_p.8_37] = STORE_LANES (vect_array.10);
ivtmp.12_23 = ivtmp.12_14 + 32;
if (ivtmp.12_23 != _54)
goto <bb 5>; [83.33%]
else
goto <bb 6>; [16.67%]
for which we emit:
vmov.i32 q12, #4 @ v4si
vmov.i32 q9, #1 @ v4si
...
vldr d16, .L13
vldr d17, .L13+8
.L4:
vmov q10, q8 @ v4si
vadd.i32 q11, q8, q9
vadd.i32 q8, q8, q12
vst2.32 {d20-d23}, [r3]!
cmp r3, r2
bne .L4
vst2.32 seems to be documented to do 32-bit interleaving, so if qN registers
overlap d{2*N} and d{2*N+1} registers, I guess this does the right thing.