https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88276

            Bug ID: 88276
           Summary: AVX512: reorder bit ops to get free and operation
           Product: gcc
           Version: 8.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
          Assignee: unassigned at gcc dot gnu.org
          Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

[code]
#include <immintrin.h>
#include <stdint.h>

int test1(const __m128i* src, int mask)
{
    __m128i v = _mm_load_si128(src);
    int cmp = _mm_cmpeq_epi16_mask(v, _mm_setzero_si128());
    return (cmp << 1) & mask;
}

int test2(const __m128i* src, int mask)
{
    __m128i v = _mm_load_si128(src);
    int cmp = _mm_cmpeq_epi16_mask(v, _mm_setzero_si128());
    return (cmp & (mask >> 1)) << 1;
}
[/code]

test1() shifts result of _mm_cmpeq_epi16_mask() first, then and it with mask.
In test2() mask is shifted first, then and-ed with cmp result, and then shifted
again. In this case result of _mm_cmpeq_epi16_mask uses 8 bits only, so both
code versions are equivalent.

This compiles to following asm code, using gcc 8.2 with -O3
-march=skylake-avx512:

[asm]
test1(long long __vector(2) const*, int):
        vpxor   xmm0, xmm0, xmm0
        vpcmpeqw        k1, xmm0, XMMWORD PTR [rdi]
        kmovb   edx, k1
        lea     eax, [rdx+rdx]
        and     eax, esi
        ret
test2(long long __vector(2) const*, int):
        mov     eax, esi
        sar     eax
        vpxor   xmm0, xmm0, xmm0
        kmovb   k2, eax
        vpcmpeqw        k1{k2}, xmm0, XMMWORD PTR [rdi]
        kmovb   eax, k1
        add     eax, eax
        ret
[/asm]

Such change may lead to more effective code, as with AVX512 this and op can be
merged into vpcmpeqw instruction. In my case this was part of bigger function
which was performing series of such calculations on array, and after this
change it started working faster.

Reply via email to