https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82518

--- Comment #44 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
Maybe -O3 -mcpu=cortex-a9 -mfpu=neon-fp16 -mfloat-abi=hard is needed.
With that I certainly see the #c42 loop vectorized.

On x86_64 we get in *.optimized:
  <bb 5> [local count: 567644349]:
  # vect_vec_iv_.4_33 = PHI <{ 0, 1, 2, 3, 4, 5, 6, 7 }(4),
vect_vec_iv_.4_34(5)>
  # ivtmp.10_14 = PHI <ivtmp.10_85(4), ivtmp.10_23(5)>
  vect_vec_iv_.4_34 = vect_vec_iv_.4_33 + { 8, 8, 8, 8, 8, 8, 8, 8 };
  vect__4.5_36 = vect_vec_iv_.4_33 + { 1, 1, 1, 1, 1, 1, 1, 1 };
  vect_inter_high_39 = VEC_PERM_EXPR <vect_vec_iv_.4_33, vect__4.5_36, { 0, 8,
1, 9, 2, 10, 3, 11 }>;
  vect_inter_low_40 = VEC_PERM_EXPR <vect_vec_iv_.4_33, vect__4.5_36, { 4, 12,
5, 13, 6, 14, 7, 15 }>;
  _86 = (void *) ivtmp.10_14;
  MEM[base: _86, offset: 0B] = vect_inter_high_39;
  MEM[base: _86, offset: 32B] = vect_inter_low_40;
  ivtmp.10_23 = ivtmp.10_14 + 64;
  if (ivtmp.10_23 != _90)
    goto <bb 5>; [83.33%]
  else
    goto <bb 6>; [16.67%]
which doesn't look optimal either, in this case I'd say better would be to have
two IVs bumped by { 8, ... 8 } in each iteration, one starting with
{ 0, 1, 1, 2, 2, 3, 3, 4 } and another with
{ 4, 5, 5, 6, 6, 7, 7, 8 } or just one and add { 4, ... 4 }; to it for the
second store and avoid both VEC_PERM_EXPRs in that case.

On armeb with the above options I see:
  <bb 5> [local count: 504572758]:
  # vect_vec_iv_.7_45 = PHI <{ 0, 1, 2, 3 }(4), vect_vec_iv_.7_46(5)>
  # ivtmp.31_128 = PHI <ivtmp.31_130(4), ivtmp.31_129(5)>
  vectp_p.9_49 = (int[8] *) ivtmp.31_128;
  vect_vec_iv_.7_46 = vect_vec_iv_.7_45 + { 4, 4, 4, 4 };
  vect__4.8_48 = vect_vec_iv_.7_45 + { 1, 1, 1, 1 };
  vect_array.11[0] = vect_vec_iv_.7_45;
  vect_array.11[1] = vect__4.8_48;
  MEM[(int *)vectp_p.9_49] = STORE_LANES (vect_array.11);
  ivtmp.31_129 = ivtmp.31_128 + 32;
  if (ivtmp.31_129 != _133)
    goto <bb 5>; [83.33%]
  else
    goto <bb 6>; [16.67%]
which looks wrong to me (because vect_vec_iv_.7_45 and vect__4.8_48 really
should be interleaved when stored into MEM[(int *)vectp_p.9_49]), but I really
don't know what exactly the STORE_LANES does.

Reply via email to