https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121744

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
Another inefficency in this testcase is that, of course, DR analysis of the
std::bitset load fails, as { 0, +, 1 } >> 6 isn't affine.  So we get an
(eumlated) gather that we fail to fully untangle into sensible code:

  # vect_vec_iv_.22_88 = PHI <_89(5), { 0, 1, 2, 3, 4, 5, 6, 7 }(2)>
  vect__1.23_90 = [vec_unpack_lo_expr] vect_vec_iv_.22_88;
  vect__1.23_91 = [vec_unpack_hi_expr] vect_vec_iv_.22_88;
  vect__15.24_92 = vect__1.23_90 >> 6;
  vect__15.24_93 = vect__1.23_91 >> 6;
  _95 = BIT_FIELD_REF <vect__15.24_92, 64, 0>;
  _96 = _95 * 8;
  _97 = _94 + _96;
  _98 = (void *) _97;
  _99 = MEM[(long unsigned int *)_98]; 
  _100 = BIT_FIELD_REF <vect__15.24_92, 64, 64>;
  _101 = _100 * 8;
  _102 = _94 + _101;
  _103 = (void *) _102;
  _104 = MEM[(long unsigned int *)_103];
  _105 = BIT_FIELD_REF <vect__15.24_92, 64, 128>;
  _106 = _105 * 8;
  _107 = _94 + _106;
  _108 = (void *) _107;
  _109 = MEM[(long unsigned int *)_108];
  _110 = BIT_FIELD_REF <vect__15.24_92, 64, 192>;
  _111 = _110 * 8;
  _112 = _94 + _111;
  _113 = (void *) _112;
  _114 = MEM[(long unsigned int *)_113];
  vect__10.25_115 = {_99, _104, _109, _114};
...

etc. which is the literal (unsigned long)i >> 6 gather.

Directly pattern-matching arr[i >> 6] to require a VF of 1 << 6 and thus
making the vector IV affine and then realizing we can directly load
a 64bit AVX512 mask register would be perfect here.

But half-way we could detect a set of "derived inductions", in this case
(unsigned long)i >> 6 should be known to be uniform through the vector
loop and thus the gather should degenerate to a splat of a single load.

Reply via email to