https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121744
--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
Another inefficency in this testcase is that, of course, DR analysis of the
std::bitset load fails, as { 0, +, 1 } >> 6 isn't affine. So we get an
(eumlated) gather that we fail to fully untangle into sensible code:
# vect_vec_iv_.22_88 = PHI <_89(5), { 0, 1, 2, 3, 4, 5, 6, 7 }(2)>
vect__1.23_90 = [vec_unpack_lo_expr] vect_vec_iv_.22_88;
vect__1.23_91 = [vec_unpack_hi_expr] vect_vec_iv_.22_88;
vect__15.24_92 = vect__1.23_90 >> 6;
vect__15.24_93 = vect__1.23_91 >> 6;
_95 = BIT_FIELD_REF <vect__15.24_92, 64, 0>;
_96 = _95 * 8;
_97 = _94 + _96;
_98 = (void *) _97;
_99 = MEM[(long unsigned int *)_98];
_100 = BIT_FIELD_REF <vect__15.24_92, 64, 64>;
_101 = _100 * 8;
_102 = _94 + _101;
_103 = (void *) _102;
_104 = MEM[(long unsigned int *)_103];
_105 = BIT_FIELD_REF <vect__15.24_92, 64, 128>;
_106 = _105 * 8;
_107 = _94 + _106;
_108 = (void *) _107;
_109 = MEM[(long unsigned int *)_108];
_110 = BIT_FIELD_REF <vect__15.24_92, 64, 192>;
_111 = _110 * 8;
_112 = _94 + _111;
_113 = (void *) _112;
_114 = MEM[(long unsigned int *)_113];
vect__10.25_115 = {_99, _104, _109, _114};
...
etc. which is the literal (unsigned long)i >> 6 gather.
Directly pattern-matching arr[i >> 6] to require a VF of 1 << 6 and thus
making the vector IV affine and then realizing we can directly load
a 64bit AVX512 mask register would be perfect here.
But half-way we could detect a set of "derived inductions", in this case
(unsigned long)i >> 6 should be known to be uniform through the vector
loop and thus the gather should degenerate to a splat of a single load.