https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82855
Bug ID: 82855
Summary: AVX512: replace OP+movemask with OP_mask+ktest
Product: gcc
Version: 7.2.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: c
Assignee: unassigned at gcc dot gnu.org
Reporter: [email protected]
Target Milestone: ---
Before AVX512, AVX/SSE code was written as in test1 function below: some
operation(s) created mask in vector register, it was then converted to int with
movemask instruction, and resulting int value was used in another expression -
e.g. compared with some constant. AVX512 added new k1..k7 registers and set of
instructions with _mask suffix which writes to them instead of creating mask in
vector register. So test2 function is simple attempt to rewrite test1 with new
instructions:
#include "immintrin.h"
bool test1(void* ptr)
{
__m256i v = _mm256_loadu_si256((const __m256i*)ptr);
v = _mm256_cmpeq_epi32(v, _mm256_setzero_si256());
return 0 == _mm256_movemask_epi8(v);
}
bool test2(void* ptr)
{
__m256i v = _mm256_loadu_si256((const __m256i*)ptr);
__mmask8 m = _mm256_cmpeq_epi32_mask(v, _mm256_setzero_si256());
return 0 == m;
}
I have tried to compile this using Compiler Explorer at https://godbolt.org/
with with following options:
-O3 -mavx -ftree-vectorize -mbmi -mpopcnt -mbmi2 -mavx2 -mavx512f -mavx512cd
-mavx512vl -mavx512bw -mavx512dq
gcc 7.2 and gcc trunk created following code:
test1(void*):
vmovdqu8 xmm0, XMMWORD PTR [rdi]
vinserti128 ymm1, ymm0, XMMWORD PTR [rdi+16], 0x1
vpxord xmm0, xmm0, xmm0
vpcmpeqd ymm0, ymm0, ymm1
vpmovmskb eax, ymm0
test eax, eax
sete al
ret
test2(void*):
vmovdqu8 xmm0, XMMWORD PTR [rdi]
vpxord xmm1, xmm1, xmm1
vinserti128 ymm0, ymm0, XMMWORD PTR [rdi+16], 0x1
vpcmpeqd k1, ymm0, ymm1
kmovb eax, k1
test al, al
sete al
ret
clang 5.0.0 created this:
test1(void*): # @test1(void*)
vpxor ymm0, ymm0, ymm0
vpcmpeqd k0, ymm0, ymmword ptr [rdi]
vpmovm2d ymm0, k0
vpmovmskb eax, ymm0
test eax, eax
sete al
vzeroupper
ret
test2(void*): # @test2(void*)
vpxor ymm0, ymm0, ymm0
vpcmpeqd k0, ymm0, ymmword ptr [rdi]
ktestb k0, k0
sete al
vzeroupper
ret
gcc output does not look very optimal. clang output for test2 is better, it
uses ktestb instead of kmovb+test. gcc should be able to do this too.
There is also one more possible optimization which can be applied for test1:
automatically replace OP and movemask instruction pair with OP_mask
instruction. Something like this is already performed for FMA3, gcc is able to
replace mul/add instruction pair with one FMA instruction. I do not have access
to any machine with AVX512 so I cannot perform any benchmarks. However this
kind of optimization looks promising, so it is worth exploring.