[Bug c++/71420] New: "‘type’ is not a class type" error for address-of operator overloaded for enum type
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71420 Bug ID: 71420 Summary: "‘type’ is not a class type" error for address-of operator overloaded for enum type Product: gcc Version: 5.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: nekotekina at gmail dot com Target Milestone: --- I was not using gcc compiler directly (it happened for Travis build), sorry if the information is misleading or inaccurate. Code example (C++11): namespace itype { enum type { unk = 0, add, sub, //... }; // Address-of operator for decoder<> constexpr type operator &(type value) { return value; } } // Irrelevant, only shows the meaning of the below template struct interpreter { static void unk() { std::abort(); }; static void add(); static void sub(); //... }; template struct decoder { // Implementation omitted, low-relevant usage example: static constexpr auto add = &D::add; static constexpr auto sub = &D::sub; }; decoder test1; // OK decoder test2; // OK in Clang and MSVC // Expected error: ‘itype::type’ is not a class type int main() { return 0; }
[Bug tree-optimization/103897] New: x86: Missing optimizations with _mm_undefined_si128 and PMOVSX*
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103897 Bug ID: 103897 Summary: x86: Missing optimizations with _mm_undefined_si128 and PMOVSX* Product: gcc Version: 12.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: nekotekina at gmail dot com Target Milestone: --- Hello, I was trying to use VPMOVSXWD and other PMOVSX* intrinsics and also emulate them for SSE2 targets. I noticed two (at least) distinct problems: 1) (V)PMOVSX** can use a memory operand, but emits a separate load instruction. 2) _mm_undefined_si128() always generates an additional zeroing instruction. Using it in combination with unpack and arithmetic shift instructions looks like an optimal way to emulate PMOVSX for SSE2 target. Godbolt example includes clang output for comparison. https://godbolt.org/z/KE8q9v6qG #include #include __attribute__((__target__("avx"))) void test0(__m128i* dst, __m128i* src) { // Emit VPMOVSXWD: can combine load from memory, but emits 2 instructions // Looks like gcc 8.5 was doing better *dst = _mm_cvtepi16_epi32(*src); } void test1(__m128i* dst, __m128i* src) { // Emulate VPMOVSXWD: sets zero specifically for _mm_undefined_si128 *dst = _mm_srai_epi32(_mm_unpacklo_epi16(_mm_undefined_si128(), *src), 16); } void test2(__m128i* dst, __m128i* src) { // Sets zero register but absolutely can reuse PSLLW result *dst = _mm_srai_epi32(_mm_unpacklo_epi16(_mm_undefined_si128(), _mm_slli_epi16(*src, 1)), 16); } void test3(__m128i* dst, __m128i* src) { // Similar to test1, but emulate "high" VPMOVSXWD *dst = _mm_srai_epi32(_mm_unpackhi_epi16(_mm_undefined_si128(), *src), 16); } __attribute__((__target__("avx"))) void test4(__m128i* dst, __m128i* src) { // Bonus (not sure what is the idiomatic way to MOVSX high part) *dst = _mm_srai_epi32(_mm_unpackhi_epi16(_mm_undefined_si128(), *src), 16); } __attribute__((__target__("avx"))) void test5(__m128i* dst, __m128i* src) { // Emits two zeroing instructions *dst = _mm_srai_epi32(_mm_unpackhi_epi16(_mm_undefined_si128(), _mm_packs_epi16(_mm_undefined_si128(), *src)), 16); }
[Bug c++/103932] New: x86: strange unoptimized code generated (multiple negations of _mm_testz_si128 result)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103932 Bug ID: 103932 Summary: x86: strange unoptimized code generated (multiple negations of _mm_testz_si128 result) Product: gcc Version: 12.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: nekotekina at gmail dot com Target Milestone: --- GCC generates seemingly unoptimized sequence of instructions in certain cases (can't tell exactly what triggers it, example code is below): xor eax, eax vptest xmm0, xmm0 seteal testeax, eax seteal movzx eax, al This should be something like this: xor eax, eax vptest xmm0, xmm0 setne al https://godbolt.org/z/sTaG65Ksc Code (-O3 -std=c++20 -march=skylake): #include #include #include #include template concept Vector128 = (sizeof(T) == 16); using u64 = std::uint64_t; using u32 = std::uint32_t; union alignas(16) v128 { u64 _u64[2]; v128() = default; constexpr v128(const v128&) noexcept = default; template constexpr v128(const T& rhs) noexcept : v128(std::bit_cast(rhs)) { } constexpr v128& operator=(const v128&) noexcept = default; template constexpr operator T() const noexcept { return std::bit_cast(*this); } }; // Test if vector is zero inline bool gv_testz(const v128& arg) { #if defined(__SSE4_1__) return _mm_testz_si128(arg, arg); #else return !(arg._u64[0] | arg._u64[1]); #endif } struct alignas(16) context_t { v128 vec[32]; v128 sat; }; void test1(context_t& ctx, u32 n) { const u64 bit = !gv_testz(ctx.sat); v128 r; r._u64[0] = 0; r._u64[1] = bit; ctx.vec[n] = r; } void test2(context_t& ctx, u32 n) { ctx.vec[n]._u64[1] = !gv_testz(ctx.sat); }
[Bug target/103967] New: x86-64: bitfields make inefficient indexing for array with 16 byte+ objects
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103967 Bug ID: 103967 Summary: x86-64: bitfields make inefficient indexing for array with 16 byte+ objects Product: gcc Version: 12.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: nekotekina at gmail dot com Target Milestone: --- Hello, this problem is seemingly not specific to GCC and is probably well known. Loading or storing 16-byte (or larger) vector from array using a bitfield as an index generates code that can be noticeably smaller in theory. shr esi, 12 ; shift bitfield and esi, 31 ; mask bitfield sal rsi, 4 ; unnecessary, also could drop REX prefix for size pxorxmm0, XMMWORD PTR [rsi+1024+rdi] ; index + offset addressing 1) Second shift can be fused with bitfield load 2) Bitfield load can then be adjusted for shifted indexing (rsi*8) 3) Optionally, array offset can be precomputed if it's used twice or more, which can result in smaller and potencially faster code. shr esi, 12 - 1 ; adjusted shift and esi, 31 << 1 ; adjusted mask which fits in 8-bit immediate pxor xmm0, [rdi + rsi * 8] ; precomputed array offset https://godbolt.org/z/7aa7oaMhn #include struct bitfields { unsigned dummy : 7; unsigned a : 5; unsigned b : 5; unsigned c : 5; }; struct context { unsigned dummy[256]; __m128i data[32]; }; void xor_data(context& ctx, bitfields op) { ctx.data[op.c] = _mm_xor_si128(ctx.data[op.a], ctx.data[op.b]); }
[Bug target/103973] New: x86: 4-way comparison of floats/doubles with spaceship operator possibly suboptimal
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103973 Bug ID: 103973 Summary: x86: 4-way comparison of floats/doubles with spaceship operator possibly suboptimal Product: gcc Version: 12.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: nekotekina at gmail dot com Target Milestone: --- Hello, I may be missing something here but the generated code seems strange and suboptimal. It looks like all 4 possible paths can use flags from a single UCOMISD instruction, not calling it 3 times in worst case. cmp4way(double, double): ucomisd xmm0, xmm1 jp .L8 mov eax, 0 jne .L8 .L2: ret .L8: comisd xmm1, xmm0 mov eax, -1 ja .L2 ucomisd xmm0, xmm1 setbe al add eax, 1 ret https://godbolt.org/z/j1j7G1MYP #include auto cmp4way(double a, double b) { return a <=> b; }
[Bug target/103973] x86: 4-way comparison of floats/doubles with spaceship operator possibly suboptimal
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103973 --- Comment #4 from Ivan --- So there is nothing to improve here? That's good to know, I suppose it can be closed then.
[Bug target/104151] New: x86: excessive code generated for 128-bit byteswap
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104151 Bug ID: 104151 Summary: x86: excessive code generated for 128-bit byteswap Product: gcc Version: 12.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: nekotekina at gmail dot com Target Milestone: --- Hello, noticed that gcc generates redundant sequence of instructions for code that does 128-bit byteswap implemented with 2 64-bit byteswap intrinsics. I narrowed it to something like this: __uint128_t bswap(__uint128_t a) { std::uint64_t x[2]; memcpy(x, &a, 16); std::uint64_t y[2]; y[0] = __builtin_bswap64(x[1]); y[1] = __builtin_bswap64(x[0]); memcpy(&a, y, 16); return a; } Produces: https://godbolt.org/z/hEsPqvhv3 mov QWORD PTR [rsp-24], rdi mov QWORD PTR [rsp-16], rsi movdqa xmm0, XMMWORD PTR [rsp-24] palignr xmm0, xmm0, 8 movdqa xmm1, xmm0 pshufb xmm1, XMMWORD PTR .LC0[rip] movaps XMMWORD PTR [rsp-24], xmm1 mov rax, QWORD PTR [rsp-24] mov rdx, QWORD PTR [rsp-16] ret Expected (alternatively for simd types - single pshufb, clang can do it): mov rdx, rdi mov rax, rsi bswap rdx bswap rax ret