https://gcc.gnu.org/bugzilla/show_bug.cgi?id=122074

Hongtao Liu <liuhongt at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |liuhongt at gcc dot gnu.org

--- Comment #7 from Hongtao Liu <liuhongt at gcc dot gnu.org> ---
(In reply to rockeet from comment #6)
> It is interesting that GCC fused the load into cmp if change the code a
> little:
> 
> size_t avx512_search_byte_max32_2(const byte_t* data, size_t len, byte_t
> key) {
>   __mmask32 k = _bzhi_u32(-1, len);
>   return _tzcnt_u32(_mm256_mask_cmpeq_epi8_mask(k,
>                      *(__m256i_u*)data, _mm256_set1_epi8(key)));
> }
> 
> see https://godbolt.org/z/W8MKTbKPv , it still generated an extra `mov eax,
> eax`

        vpcmpeqb        k0{k1}, ymm0, YMMWORD PTR [rdi] # 99      [c=25 l=6] 
*avx512vl_eqv32qi3_mask_1/0
        kmovd   eax, k0       # 122 [c=4 l=3]  *movsi_internal/16
        tzcnt   eax, eax      # 107       [c=4 l=4]  tzcnt_si
        mov     eax, eax  # 110       [c=4 l=2]  *zero_extendsidi2/3

The `mov eax, eax` is a zero_extend from 32-bit to 64-bit, and yes it looks
redundant since upper part of tzcnt result must be zero.

Reply via email to