https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88276
Bug ID: 88276 Summary: AVX512: reorder bit ops to get free and operation Product: gcc Version: 8.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: bugzi...@poradnik-webmastera.com Target Milestone: --- [code] #include <immintrin.h> #include <stdint.h> int test1(const __m128i* src, int mask) { __m128i v = _mm_load_si128(src); int cmp = _mm_cmpeq_epi16_mask(v, _mm_setzero_si128()); return (cmp << 1) & mask; } int test2(const __m128i* src, int mask) { __m128i v = _mm_load_si128(src); int cmp = _mm_cmpeq_epi16_mask(v, _mm_setzero_si128()); return (cmp & (mask >> 1)) << 1; } [/code] test1() shifts result of _mm_cmpeq_epi16_mask() first, then and it with mask. In test2() mask is shifted first, then and-ed with cmp result, and then shifted again. In this case result of _mm_cmpeq_epi16_mask uses 8 bits only, so both code versions are equivalent. This compiles to following asm code, using gcc 8.2 with -O3 -march=skylake-avx512: [asm] test1(long long __vector(2) const*, int): vpxor xmm0, xmm0, xmm0 vpcmpeqw k1, xmm0, XMMWORD PTR [rdi] kmovb edx, k1 lea eax, [rdx+rdx] and eax, esi ret test2(long long __vector(2) const*, int): mov eax, esi sar eax vpxor xmm0, xmm0, xmm0 kmovb k2, eax vpcmpeqw k1{k2}, xmm0, XMMWORD PTR [rdi] kmovb eax, k1 add eax, eax ret [/asm] Such change may lead to more effective code, as with AVX512 this and op can be merged into vpcmpeqw instruction. In my case this was part of bigger function which was performing series of such calculations on array, and after this change it started working faster.