https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82518

--- Comment #45 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Note the vectorized loop is pretty much the same on arm little-endian,
  # vect_vec_iv_.6_33 = PHI <{ 0, 1, 2, 3 }(4), vect_vec_iv_.6_34(5)>
  # ivtmp.12_14 = PHI <ivtmp.12_51(4), ivtmp.12_23(5)>
  vectp_p.8_37 = (int[8] *) ivtmp.12_14;
  vect_vec_iv_.6_34 = vect_vec_iv_.6_33 + { 4, 4, 4, 4 };
  vect__4.7_36 = vect_vec_iv_.6_33 + { 1, 1, 1, 1 };
  vect_array.10[0] = vect_vec_iv_.6_33;
  vect_array.10[1] = vect__4.7_36;
  MEM[(int *)vectp_p.8_37] = STORE_LANES (vect_array.10);
  ivtmp.12_23 = ivtmp.12_14 + 32;
  if (ivtmp.12_23 != _54)
    goto <bb 5>; [83.33%]
  else
    goto <bb 6>; [16.67%]
for which we emit:
        vmov.i32        q12, #4  @ v4si
        vmov.i32        q9, #1  @ v4si
...
        vldr    d16, .L13
        vldr    d17, .L13+8
.L4:
        vmov    q10, q8  @ v4si
        vadd.i32        q11, q8, q9
        vadd.i32        q8, q8, q12
        vst2.32 {d20-d23}, [r3]!
        cmp     r3, r2
        bne     .L4

vst2.32 seems to be documented to do 32-bit interleaving, so if qN registers
overlap d{2*N} and d{2*N+1} registers, I guess this does the right thing.

Reply via email to