https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82518
--- Comment #44 from Jakub Jelinek <jakub at gcc dot gnu.org> --- Maybe -O3 -mcpu=cortex-a9 -mfpu=neon-fp16 -mfloat-abi=hard is needed. With that I certainly see the #c42 loop vectorized. On x86_64 we get in *.optimized: <bb 5> [local count: 567644349]: # vect_vec_iv_.4_33 = PHI <{ 0, 1, 2, 3, 4, 5, 6, 7 }(4), vect_vec_iv_.4_34(5)> # ivtmp.10_14 = PHI <ivtmp.10_85(4), ivtmp.10_23(5)> vect_vec_iv_.4_34 = vect_vec_iv_.4_33 + { 8, 8, 8, 8, 8, 8, 8, 8 }; vect__4.5_36 = vect_vec_iv_.4_33 + { 1, 1, 1, 1, 1, 1, 1, 1 }; vect_inter_high_39 = VEC_PERM_EXPR <vect_vec_iv_.4_33, vect__4.5_36, { 0, 8, 1, 9, 2, 10, 3, 11 }>; vect_inter_low_40 = VEC_PERM_EXPR <vect_vec_iv_.4_33, vect__4.5_36, { 4, 12, 5, 13, 6, 14, 7, 15 }>; _86 = (void *) ivtmp.10_14; MEM[base: _86, offset: 0B] = vect_inter_high_39; MEM[base: _86, offset: 32B] = vect_inter_low_40; ivtmp.10_23 = ivtmp.10_14 + 64; if (ivtmp.10_23 != _90) goto <bb 5>; [83.33%] else goto <bb 6>; [16.67%] which doesn't look optimal either, in this case I'd say better would be to have two IVs bumped by { 8, ... 8 } in each iteration, one starting with { 0, 1, 1, 2, 2, 3, 3, 4 } and another with { 4, 5, 5, 6, 6, 7, 7, 8 } or just one and add { 4, ... 4 }; to it for the second store and avoid both VEC_PERM_EXPRs in that case. On armeb with the above options I see: <bb 5> [local count: 504572758]: # vect_vec_iv_.7_45 = PHI <{ 0, 1, 2, 3 }(4), vect_vec_iv_.7_46(5)> # ivtmp.31_128 = PHI <ivtmp.31_130(4), ivtmp.31_129(5)> vectp_p.9_49 = (int[8] *) ivtmp.31_128; vect_vec_iv_.7_46 = vect_vec_iv_.7_45 + { 4, 4, 4, 4 }; vect__4.8_48 = vect_vec_iv_.7_45 + { 1, 1, 1, 1 }; vect_array.11[0] = vect_vec_iv_.7_45; vect_array.11[1] = vect__4.8_48; MEM[(int *)vectp_p.9_49] = STORE_LANES (vect_array.11); ivtmp.31_129 = ivtmp.31_128 + 32; if (ivtmp.31_129 != _133) goto <bb 5>; [83.33%] else goto <bb 6>; [16.67%] which looks wrong to me (because vect_vec_iv_.7_45 and vect__4.8_48 really should be interleaved when stored into MEM[(int *)vectp_p.9_49]), but I really don't know what exactly the STORE_LANES does.