https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82855
Bug ID: 82855 Summary: AVX512: replace OP+movemask with OP_mask+ktest Product: gcc Version: 7.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: bugzi...@poradnik-webmastera.com Target Milestone: --- Before AVX512, AVX/SSE code was written as in test1 function below: some operation(s) created mask in vector register, it was then converted to int with movemask instruction, and resulting int value was used in another expression - e.g. compared with some constant. AVX512 added new k1..k7 registers and set of instructions with _mask suffix which writes to them instead of creating mask in vector register. So test2 function is simple attempt to rewrite test1 with new instructions: #include "immintrin.h" bool test1(void* ptr) { __m256i v = _mm256_loadu_si256((const __m256i*)ptr); v = _mm256_cmpeq_epi32(v, _mm256_setzero_si256()); return 0 == _mm256_movemask_epi8(v); } bool test2(void* ptr) { __m256i v = _mm256_loadu_si256((const __m256i*)ptr); __mmask8 m = _mm256_cmpeq_epi32_mask(v, _mm256_setzero_si256()); return 0 == m; } I have tried to compile this using Compiler Explorer at https://godbolt.org/ with with following options: -O3 -mavx -ftree-vectorize -mbmi -mpopcnt -mbmi2 -mavx2 -mavx512f -mavx512cd -mavx512vl -mavx512bw -mavx512dq gcc 7.2 and gcc trunk created following code: test1(void*): vmovdqu8 xmm0, XMMWORD PTR [rdi] vinserti128 ymm1, ymm0, XMMWORD PTR [rdi+16], 0x1 vpxord xmm0, xmm0, xmm0 vpcmpeqd ymm0, ymm0, ymm1 vpmovmskb eax, ymm0 test eax, eax sete al ret test2(void*): vmovdqu8 xmm0, XMMWORD PTR [rdi] vpxord xmm1, xmm1, xmm1 vinserti128 ymm0, ymm0, XMMWORD PTR [rdi+16], 0x1 vpcmpeqd k1, ymm0, ymm1 kmovb eax, k1 test al, al sete al ret clang 5.0.0 created this: test1(void*): # @test1(void*) vpxor ymm0, ymm0, ymm0 vpcmpeqd k0, ymm0, ymmword ptr [rdi] vpmovm2d ymm0, k0 vpmovmskb eax, ymm0 test eax, eax sete al vzeroupper ret test2(void*): # @test2(void*) vpxor ymm0, ymm0, ymm0 vpcmpeqd k0, ymm0, ymmword ptr [rdi] ktestb k0, k0 sete al vzeroupper ret gcc output does not look very optimal. clang output for test2 is better, it uses ktestb instead of kmovb+test. gcc should be able to do this too. There is also one more possible optimization which can be applied for test1: automatically replace OP and movemask instruction pair with OP_mask instruction. Something like this is already performed for FMA3, gcc is able to replace mul/add instruction pair with one FMA instruction. I do not have access to any machine with AVX512 so I cannot perform any benchmarks. However this kind of optimization looks promising, so it is worth exploring.