https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82518
--- Comment #45 from Jakub Jelinek <jakub at gcc dot gnu.org> --- Note the vectorized loop is pretty much the same on arm little-endian, # vect_vec_iv_.6_33 = PHI <{ 0, 1, 2, 3 }(4), vect_vec_iv_.6_34(5)> # ivtmp.12_14 = PHI <ivtmp.12_51(4), ivtmp.12_23(5)> vectp_p.8_37 = (int[8] *) ivtmp.12_14; vect_vec_iv_.6_34 = vect_vec_iv_.6_33 + { 4, 4, 4, 4 }; vect__4.7_36 = vect_vec_iv_.6_33 + { 1, 1, 1, 1 }; vect_array.10[0] = vect_vec_iv_.6_33; vect_array.10[1] = vect__4.7_36; MEM[(int *)vectp_p.8_37] = STORE_LANES (vect_array.10); ivtmp.12_23 = ivtmp.12_14 + 32; if (ivtmp.12_23 != _54) goto <bb 5>; [83.33%] else goto <bb 6>; [16.67%] for which we emit: vmov.i32 q12, #4 @ v4si vmov.i32 q9, #1 @ v4si ... vldr d16, .L13 vldr d17, .L13+8 .L4: vmov q10, q8 @ v4si vadd.i32 q11, q8, q9 vadd.i32 q8, q8, q12 vst2.32 {d20-d23}, [r3]! cmp r3, r2 bne .L4 vst2.32 seems to be documented to do 32-bit interleaving, so if qN registers overlap d{2*N} and d{2*N+1} registers, I guess this does the right thing.