[Bug c++/71420] New: "‘type’ is not a class type" error for address-of operator overloaded for enum type

2016-06-05 Thread nekotekina at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71420

Bug ID: 71420
   Summary: "‘type’ is not a class type" error for address-of
operator overloaded for enum type
   Product: gcc
   Version: 5.3.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: nekotekina at gmail dot com
  Target Milestone: ---

I was not using gcc compiler directly (it happened for Travis build), sorry if
the information is misleading or inaccurate.

Code example (C++11):

namespace itype
{
enum type
{
unk = 0,

add,
sub,
//...
};

// Address-of operator for decoder<>
constexpr type operator &(type value)
{
return value;
}
}

// Irrelevant, only shows the meaning of the below template
struct interpreter
{
static void unk() { std::abort(); };

static void add();
static void sub();
//...
};

template
struct decoder
{
// Implementation omitted, low-relevant usage example:
static constexpr auto add = &D::add;
static constexpr auto sub = &D::sub;
};

decoder test1; // OK
decoder test2; // OK in Clang and MSVC

// Expected error: ‘itype::type’ is not a class type

int main()
{
return 0;
}

[Bug tree-optimization/103897] New: x86: Missing optimizations with _mm_undefined_si128 and PMOVSX*

2022-01-03 Thread nekotekina at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103897

Bug ID: 103897
   Summary: x86: Missing optimizations with _mm_undefined_si128
and PMOVSX*
   Product: gcc
   Version: 12.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: nekotekina at gmail dot com
  Target Milestone: ---

Hello, I was trying to use VPMOVSXWD and other PMOVSX* intrinsics and also
emulate them for SSE2 targets. I noticed two (at least) distinct problems:
1) (V)PMOVSX** can use a memory operand, but emits a separate load instruction.
2) _mm_undefined_si128() always generates an additional zeroing instruction.
Using it in combination with unpack and arithmetic shift instructions looks
like an optimal way to emulate PMOVSX for SSE2 target.

Godbolt example includes clang output for comparison.

https://godbolt.org/z/KE8q9v6qG

#include 
#include 

__attribute__((__target__("avx"))) void test0(__m128i* dst, __m128i* src)
{
// Emit VPMOVSXWD: can combine load from memory, but emits 2 instructions
// Looks like gcc 8.5 was doing better
*dst = _mm_cvtepi16_epi32(*src);
}

void test1(__m128i* dst, __m128i* src)
{
// Emulate VPMOVSXWD: sets zero specifically for _mm_undefined_si128
*dst = _mm_srai_epi32(_mm_unpacklo_epi16(_mm_undefined_si128(), *src), 16);
}

void test2(__m128i* dst, __m128i* src)
{
// Sets zero register but absolutely can reuse PSLLW result
*dst = _mm_srai_epi32(_mm_unpacklo_epi16(_mm_undefined_si128(),
_mm_slli_epi16(*src, 1)), 16);
}

void test3(__m128i* dst, __m128i* src)
{
// Similar to test1, but emulate "high" VPMOVSXWD
*dst = _mm_srai_epi32(_mm_unpackhi_epi16(_mm_undefined_si128(), *src), 16);
}

__attribute__((__target__("avx"))) void test4(__m128i* dst, __m128i* src)
{
// Bonus (not sure what is the idiomatic way to MOVSX high part)
*dst = _mm_srai_epi32(_mm_unpackhi_epi16(_mm_undefined_si128(), *src), 16);
}

__attribute__((__target__("avx"))) void test5(__m128i* dst, __m128i* src)
{
// Emits two zeroing instructions
*dst = _mm_srai_epi32(_mm_unpackhi_epi16(_mm_undefined_si128(),
_mm_packs_epi16(_mm_undefined_si128(), *src)), 16);
}

[Bug c++/103932] New: x86: strange unoptimized code generated (multiple negations of _mm_testz_si128 result)

2022-01-06 Thread nekotekina at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103932

Bug ID: 103932
   Summary: x86: strange unoptimized code generated (multiple
negations of _mm_testz_si128 result)
   Product: gcc
   Version: 12.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: nekotekina at gmail dot com
  Target Milestone: ---

GCC generates seemingly unoptimized sequence of instructions in certain cases
(can't tell exactly what triggers it, example code is below):

xor eax, eax
vptest  xmm0, xmm0
seteal
testeax, eax
seteal
movzx   eax, al

This should be something like this:
xor eax, eax
vptest xmm0, xmm0
setne al


https://godbolt.org/z/sTaG65Ksc
Code (-O3 -std=c++20 -march=skylake):

#include 
#include 
#include 
#include 

template 
concept Vector128 = (sizeof(T) == 16);

using u64 = std::uint64_t;
using u32 = std::uint32_t;

union alignas(16) v128
{
u64 _u64[2];

v128() = default;

constexpr v128(const v128&) noexcept = default;

template 
constexpr v128(const T& rhs) noexcept
: v128(std::bit_cast(rhs))
{
}

constexpr v128& operator=(const v128&) noexcept = default;

template 
constexpr operator T() const noexcept
{
return std::bit_cast(*this);
}
};

// Test if vector is zero
inline bool gv_testz(const v128& arg)
{
#if defined(__SSE4_1__)
return _mm_testz_si128(arg, arg);
#else
return !(arg._u64[0] | arg._u64[1]);
#endif
}

struct alignas(16) context_t
{
v128 vec[32];
v128 sat;
};

void test1(context_t& ctx, u32 n)
{
const u64 bit = !gv_testz(ctx.sat);
v128 r;
r._u64[0] = 0;
r._u64[1] = bit;
ctx.vec[n] = r;
}

void test2(context_t& ctx, u32 n)
{
ctx.vec[n]._u64[1] = !gv_testz(ctx.sat);
}

[Bug target/103967] New: x86-64: bitfields make inefficient indexing for array with 16 byte+ objects

2022-01-10 Thread nekotekina at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103967

Bug ID: 103967
   Summary: x86-64: bitfields make inefficient indexing for array
with 16 byte+ objects
   Product: gcc
   Version: 12.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: nekotekina at gmail dot com
  Target Milestone: ---

Hello, this problem is seemingly not specific to GCC and is probably well
known. Loading or storing 16-byte (or larger) vector from array using a
bitfield as an index generates code that can be noticeably smaller in theory.

shr esi, 12 ; shift bitfield
and esi, 31 ; mask bitfield
sal rsi, 4 ; unnecessary, also could drop REX prefix for size
pxorxmm0, XMMWORD PTR [rsi+1024+rdi] ; index + offset addressing

1) Second shift can be fused with bitfield load
2) Bitfield load can then be adjusted for shifted indexing (rsi*8)
3) Optionally, array offset can be precomputed if it's used twice or more,
which can result in smaller and potencially faster code.

shr esi, 12 - 1 ; adjusted shift
and esi, 31 << 1 ; adjusted mask which fits in 8-bit immediate
pxor xmm0, [rdi + rsi * 8] ; precomputed array offset

https://godbolt.org/z/7aa7oaMhn

#include 
struct bitfields
{
unsigned dummy : 7;
unsigned a : 5;
unsigned b : 5;
unsigned c : 5;
};
struct context
{
unsigned dummy[256];
__m128i data[32];
};

void xor_data(context& ctx, bitfields op)
{
ctx.data[op.c] = _mm_xor_si128(ctx.data[op.a], ctx.data[op.b]);
}

[Bug target/103973] New: x86: 4-way comparison of floats/doubles with spaceship operator possibly suboptimal

2022-01-10 Thread nekotekina at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103973

Bug ID: 103973
   Summary: x86: 4-way comparison of floats/doubles with spaceship
operator possibly suboptimal
   Product: gcc
   Version: 12.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: nekotekina at gmail dot com
  Target Milestone: ---

Hello, I may be missing something here but the generated code seems strange and
suboptimal. It looks like all 4 possible paths can use flags from a single
UCOMISD instruction, not calling it 3 times in worst case.

cmp4way(double, double):
ucomisd xmm0, xmm1
jp  .L8
mov eax, 0
jne .L8
.L2:
ret
.L8:
comisd  xmm1, xmm0
mov eax, -1
ja  .L2
ucomisd xmm0, xmm1
setbe   al
add eax, 1
ret

https://godbolt.org/z/j1j7G1MYP

#include 

auto cmp4way(double a, double b)
{
return a <=> b;
}

[Bug target/103973] x86: 4-way comparison of floats/doubles with spaceship operator possibly suboptimal

2022-01-10 Thread nekotekina at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103973

--- Comment #4 from Ivan  ---
So there is nothing to improve here? That's good to know, I suppose it can be
closed then.

[Bug target/104151] New: x86: excessive code generated for 128-bit byteswap

2022-01-20 Thread nekotekina at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104151

Bug ID: 104151
   Summary: x86: excessive code generated for 128-bit byteswap
   Product: gcc
   Version: 12.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: nekotekina at gmail dot com
  Target Milestone: ---

Hello, noticed that gcc generates redundant sequence of instructions for code
that does 128-bit byteswap implemented with 2 64-bit byteswap intrinsics. I
narrowed it to something like this:

__uint128_t bswap(__uint128_t a)
{
std::uint64_t x[2];
memcpy(x, &a, 16);
std::uint64_t y[2];
y[0] = __builtin_bswap64(x[1]);
y[1] = __builtin_bswap64(x[0]);
memcpy(&a, y, 16);
return a;
}

Produces:
https://godbolt.org/z/hEsPqvhv3

mov QWORD PTR [rsp-24], rdi
mov QWORD PTR [rsp-16], rsi
movdqa  xmm0, XMMWORD PTR [rsp-24]
palignr xmm0, xmm0, 8
movdqa  xmm1, xmm0
pshufb  xmm1, XMMWORD PTR .LC0[rip]
movaps  XMMWORD PTR [rsp-24], xmm1
mov rax, QWORD PTR [rsp-24]
mov rdx, QWORD PTR [rsp-16]
ret

Expected (alternatively for simd types - single pshufb, clang can do it):

mov rdx, rdi
mov rax, rsi
bswap   rdx
bswap   rax
ret