https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80517
Bug ID: 80517 Summary: [missed optimization] constant propagation through Intel intrinsics Product: gcc Version: 8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: kretz at kde dot org Target Milestone: --- Related: #55894 Testcase: #include <x86intrin.h> int f() { __m128i x{}; x = _mm_cmpeq_epi16(x, x); return _pext_u32(_mm_movemask_epi8(x), 0xaaaa); } (compile with `-mbmi2 -O3 -std=c++14`) See also https://godbolt.org/g/n92wEc This should compile to f(): movl $0xff, %eax ret Clang already implements constant propagation for the testcase except for `pext` (see godbolt link). This is just a precursor to the following testcase: #include <x86intrin.h> auto g(__m128i x, __m128i y) { __m128i mask0 = _mm_cmpeq_epi16(x, y); auto bits = _pext_u32(_mm_movemask_epi8(mask0), 0xaaaa); __m128i mask = _mm_set1_epi16(bits); mask = _mm_and_si128(mask, _mm_setr_epi16(1, 2, 4, 8, 16, 32, 64, 128)); mask = _mm_cmpeq_epi16(mask, _mm_setzero_si128()); mask = _mm_xor_si128(mask, _mm_cmpeq_epi16(mask, mask)); return mask; } This should compile to f(): vpcmpeqw %xmm0, %xmm1, %xmm0 ret I.e. The xmm mask `mask0` is converted to a bitmask and back to an xmm mask. Similar patterns exist for all arithmetic types for SSE and AVX. If you like, I can produce a list of testcases for all vector element types for SSE and AVX. Motivation: An ABI-stable mask type for x86 requires a storage format that is independent of the ISA extensions available on the specific x86 CPU. In light of AVX512, the most sensible choice for such a mask storage is std::bitset<N>. This is a natural fit for AVX512 masks but requires frequent conversion to/from xmm and ymm mask when AVX/SSE registers are modified. If the above is optimized, it would go a long way to reducing the cost of using the ABI-stable types. Reference: https://wg21.link/p0214