https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122074
--- Comment #5 from rockeet <rockeet at gmail dot com> ---
(In reply to Andrew Pinski from comment #4)
> > Suffix "_u" in __m256i_u emphpasize we are > using an unaligned vector
> > which should be > processed specially
>
> No it does not mean that. It does mean it is unaligned.
> And gcc uses an unaligned load even:
> vmovdqu ymm1, YMMWORD PTR [rdi]
>
> And which is why at -O0, the loads are via bytes.
>
>
> Now there is a missed optimization of not fusing the load into the compare.
Fusing load into compare is excellent, I think it should also optimize mask
load into compare:
```
size_t avx512_search_byte_max32(const byte_t* data, size_t len, byte_t key) {
__mmask32 k = _bzhi_u32(-1, len);
__m256i d = _mm256_maskz_loadu_epi8(k, data);
return _tzcnt_u32(_mm256_mask_cmpeq_epi8_mask(k, d, _mm256_set1_epi8(key)));
}
```
the mask load and compre use same mask reg, they should be fused.