https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82855

            Bug ID: 82855
           Summary: AVX512: replace OP+movemask with OP_mask+ktest
           Product: gcc
           Version: 7.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
          Assignee: unassigned at gcc dot gnu.org
          Reporter: bugzi...@poradnik-webmastera.com
  Target Milestone: ---

Before AVX512, AVX/SSE code was written as in test1 function below: some
operation(s) created mask in vector register, it was then converted to int with
movemask instruction, and resulting int value was used in another expression -
e.g. compared with some constant. AVX512 added new k1..k7 registers and set of
instructions with _mask suffix which writes to them instead of creating mask in
vector register. So test2 function is simple attempt to rewrite test1 with new
instructions:

#include "immintrin.h"

bool test1(void* ptr)
{
  __m256i v = _mm256_loadu_si256((const __m256i*)ptr);
  v = _mm256_cmpeq_epi32(v, _mm256_setzero_si256());
  return 0 == _mm256_movemask_epi8(v);
}

bool test2(void* ptr)
{
  __m256i v = _mm256_loadu_si256((const __m256i*)ptr);
  __mmask8 m = _mm256_cmpeq_epi32_mask(v, _mm256_setzero_si256());
  return 0 == m;
}

I have tried to compile this using Compiler Explorer at https://godbolt.org/
with with following options:
-O3 -mavx -ftree-vectorize -mbmi -mpopcnt -mbmi2 -mavx2 -mavx512f -mavx512cd
-mavx512vl -mavx512bw -mavx512dq

gcc 7.2 and gcc trunk created following code:
test1(void*):
  vmovdqu8 xmm0, XMMWORD PTR [rdi]
  vinserti128 ymm1, ymm0, XMMWORD PTR [rdi+16], 0x1
  vpxord xmm0, xmm0, xmm0
  vpcmpeqd ymm0, ymm0, ymm1
  vpmovmskb eax, ymm0
  test eax, eax
  sete al
  ret
test2(void*):
  vmovdqu8 xmm0, XMMWORD PTR [rdi]
  vpxord xmm1, xmm1, xmm1
  vinserti128 ymm0, ymm0, XMMWORD PTR [rdi+16], 0x1
  vpcmpeqd k1, ymm0, ymm1
  kmovb eax, k1
  test al, al
  sete al
  ret

clang 5.0.0 created this:

test1(void*): # @test1(void*)
  vpxor ymm0, ymm0, ymm0
  vpcmpeqd k0, ymm0, ymmword ptr [rdi]
  vpmovm2d ymm0, k0
  vpmovmskb eax, ymm0
  test eax, eax
  sete al
  vzeroupper
  ret
test2(void*): # @test2(void*)
  vpxor ymm0, ymm0, ymm0
  vpcmpeqd k0, ymm0, ymmword ptr [rdi]
  ktestb k0, k0
  sete al
  vzeroupper
  ret

gcc output does not look very optimal. clang output for test2 is better, it
uses ktestb instead of kmovb+test. gcc should be able to do this too.

There is also one more possible optimization which can be applied for test1:
automatically replace OP and movemask instruction pair with OP_mask
instruction. Something like this is already performed for FMA3, gcc is able to
replace mul/add instruction pair with one FMA instruction. I do not have access
to any machine with AVX512 so I cannot perform any benchmarks. However this
kind of optimization looks promising, so it is worth exploring.

Reply via email to