[Bug target/80517] New: [missed optimization] constant propagation through Intel intrinsics

kretz at kde dot org Tue, 25 Apr 2017 06:59:19 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80517


            Bug ID: 80517
           Summary: [missed optimization] constant propagation through
                    Intel intrinsics
           Product: gcc
           Version: 8.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: kretz at kde dot org
  Target Milestone: ---

Related: #55894

Testcase:
#include <x86intrin.h>

int f() {
    __m128i x{};
    x = _mm_cmpeq_epi16(x, x);
    return _pext_u32(_mm_movemask_epi8(x), 0xaaaa);
}

(compile with `-mbmi2 -O3 -std=c++14`)

See also https://godbolt.org/g/n92wEc

This should compile to
f():
  movl $0xff, %eax
  ret

Clang already implements constant propagation for the testcase except for
`pext` (see godbolt link).

This is just a precursor to the following testcase:
#include <x86intrin.h>

auto g(__m128i x, __m128i y) {
    __m128i mask0 = _mm_cmpeq_epi16(x, y);
    auto bits = _pext_u32(_mm_movemask_epi8(mask0), 0xaaaa);
    __m128i mask = _mm_set1_epi16(bits);
    mask = _mm_and_si128(mask, _mm_setr_epi16(1, 2, 4, 8, 16, 32, 64, 128));
    mask = _mm_cmpeq_epi16(mask, _mm_setzero_si128());
    mask = _mm_xor_si128(mask, _mm_cmpeq_epi16(mask, mask));
    return mask;
}

This should compile to
f():
  vpcmpeqw %xmm0, %xmm1, %xmm0
  ret

I.e. The xmm mask `mask0` is converted to a bitmask and back to an xmm mask.
Similar patterns exist for all arithmetic types for SSE and AVX. If you like, I
can produce a list of testcases for all vector element types for SSE and AVX.

Motivation: An ABI-stable mask type for x86 requires a storage format that is
independent of the ISA extensions available on the specific x86 CPU. In light
of AVX512, the most sensible choice for such a mask storage is std::bitset<N>.
This is a natural fit for AVX512 masks but requires frequent conversion to/from
xmm and ymm mask when AVX/SSE registers are modified. If the above is
optimized, it would go a long way to reducing the cost of using the ABI-stable
types.
Reference: https://wg21.link/p0214

[Bug target/80517] New: [missed optimization] constant propagation through Intel intrinsics

Reply via email to