[Bug c++/87105] New: Autovectorization [X86, SSE2, AVX2, DoublePrecision]

2018-08-25 Thread kobalicek.petr at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87105

Bug ID: 87105
   Summary: Autovectorization [X86, SSE2, AVX2, DoublePrecision]
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: kobalicek.petr at gmail dot com
  Target Milestone: ---

GCC is unable to autovectorize the following code. It seems that it doesn't
like min/max, but I'm not entirely sure. I stripped the code off my project so
it's a bit longer, hope that's fine. I attached also a code compiled by clang,
which is perfectly vectorized and what I would like to get from GCC.


The demonstration code
--

#include 
#include 
#include 

// Point structure [x, y]
struct Point {
  double x, y;

  inline Point() noexcept = default;
  constexpr Point(const Point&) noexcept = default;

  constexpr Point(double x, double y) noexcept
: x(x), y(y) {}
};

// Box structure [x0, y0, x1, y1]
struct Box {
  double x0, y0, x1, y1;

  inline void reset(double x0, double y0, double x1, double y1) noexcept {
this->x0 = x0;
this->y0 = y0;
this->x1 = x1;
this->y1 = y1;
  }
};

// Overloads to make vector processing simpler.
static constexpr Point operator-(const Point& a) noexcept { return Point(-a.x,
-a.y); }

static constexpr Point operator+(const Point& a, double b) noexcept
{ return Point(a.x + b, a.y + b); }
static constexpr Point operator-(const Point& a, double b) noexcept
{ return Point(a.x - b, a.y - b); }
static constexpr Point operator*(const Point& a, double b) noexcept
{ return Point(a.x * b, a.y * b); }
static constexpr Point operator/(const Point& a, double b) noexcept
{ return Point(a.x / b, a.y / b); }

static constexpr Point operator+(const Point& a, const Point& b) noexcept
{ return Point(a.x + b.x, a.y + b.y); }
static constexpr Point operator-(const Point& a, const Point& b) noexcept
{ return Point(a.x - b.x, a.y - b.y); }
static constexpr Point operator*(const Point& a, const Point& b) noexcept
{ return Point(a.x * b.x, a.y * b.y); }
static constexpr Point operator/(const Point& a, const Point& b) noexcept
{ return Point(a.x / b.x, a.y / b.y); }

static constexpr Point operator+(double a, const Point& b) noexcept
{ return Point(a + b.x, a + b.y); }
static constexpr Point operator-(double a, const Point& b) noexcept
{ return Point(a - b.x, a - b.y); }
static constexpr Point operator*(double a, const Point& b) noexcept
{ return Point(a * b.x, a * b.y); }
static constexpr Point operator/(double a, const Point& b) noexcept
{ return Point(a / b.x, a / b.y); }

// Min/Max - different semantics compared to std.
template constexpr T myMin(const T& a, const T& b) noexcept
{ return b < a ? b : a; }
template constexpr T myMax(const T& a, const T& b) noexcept
{ return a < b ? b : a; }

// Linear interpolation, works with points as well.
template
inline V lerp(const V& a, const V& b, const T& t) noexcept {
  return (a * (1.0 - t)) + (b * t);
}

// Merge a point into a box by possibly increasing its bounds.
inline void boxMergePoint(Box& box, const Point& p) noexcept {
  box.x0 = myMin(box.x0, p.x);
  box.y0 = myMin(box.y0, p.y);
  box.x1 = myMax(box.x1, p.x);
  box.y1 = myMax(box.y1, p.y);
}

void quadBoundingBoxA(const Point bez[3], Box& bBox) noexcept {
  // Bounding box of start and end points.
  bBox.reset(myMin(bez[0].x, bez[2].x), myMin(bez[0].y, bez[2].y),
 myMax(bez[0].x, bez[2].x), myMax(bez[0].y, bez[2].y));

  Point t = (bez[0] - bez[1]) / (bez[0] - bez[1] * 2.0 + bez[2]);

  t.x = myMax(t.x, 0.0);
  t.y = myMax(t.y, 0.0);
  t.x = myMin(t.x, 1.0);
  t.y = myMin(t.y, 1.0);

  boxMergePoint(bBox, lerp(lerp(bez[0], bez[1], t),
   lerp(bez[1], bez[2], t), t));
}




GCC Output [-std=c++17 -O3 -mavx2 -fno-math-errno]
--

quadBoundingBoxA(Point const*, Box&):
pushrbp
mov rbp, rsp
and rsp, -32
vmovsd  xmm1, QWORD PTR [rdi+8]
vmovsd  xmm0, QWORD PTR [rdi]
vmovsd  xmm5, QWORD PTR [rdi+40]
vmovsd  xmm6, QWORD PTR [rdi+32]
vmaxsd  xmm13, xmm5, xmm1
vmaxsd  xmm12, xmm6, xmm0
vminsd  xmm5, xmm5, xmm1
vminsd  xmm6, xmm6, xmm0
vunpcklpd   xmm0, xmm12, xmm13
vunpcklpd   xmm1, xmm6, xmm5
vmovups XMMWORD PTR [rsi+16], xmm0
vmovups XMMWORD PTR [rsi], xmm1
vmovsd  xmm2, QWORD PTR [rdi+24]
vmovsd  xmm10, QWORD PTR [rdi+8]
vmovsd  xmm1, QWORD PTR [rdi+40]
vmovsd  xmm7, QWORD PTR [rdi+16]
vaddsd  xmm4, xmm2, xmm2
vsubsd  xmm9, xmm10, xmm2
vmovsd  xmm3, QWORD PTR [rdi]
vmovsd  xmm0, QWORD PTR [r

[Bug tree-optimization/87105] Autovectorization [X86, SSE2, AVX2, DoublePrecision]

2018-08-26 Thread kobalicek.petr at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87105

--- Comment #4 from Petr  ---
I think this code is vectorizable without --fast-math. However, it seems that
once a min/max (or something else) is kept scalar it poisons the rest of the
code.

The following code works perfectly (scalar):

```
#include 

template constexpr T altMinT(const T& a, const T& b) noexcept
{ return b < a ? b : a; }
template constexpr T altMaxT(const T& a, const T& b) noexcept
{ return a < b ? b : a; }

double std_min(double a, double b) { return std::min(a, b); }
double std_max(double a, double b) { return std::max(a, b); }

double alt_min(double a, double b) { return altMinT(a, b); }
double alt_max(double a, double b) { return altMaxT(a, b); }

```

I think that's the main problem - little samples are optimized well, complex
code isn't.

[Bug tree-optimization/87105] Autovectorization [X86, SSE2, AVX2, DoublePrecision]

2018-08-26 Thread kobalicek.petr at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87105

--- Comment #6 from Petr  ---
I think the test-case can even be simplified to something like this:

#include 
#include 

struct Point {
  double x, y;

  void reset(double x, double y) {
this->x = x;
this->y = y;
  }
};

void f1(Point* p, Point* a) {
  p->reset(std::max(std::sqrt(p->x), a->x),
   std::max(std::sqrt(p->y), a->y));
}

GCC is unable to vectorize it:

  [-std=c++17 -O3 -mavx2 -fno-math-errno]
  f1(Point*, Point*):
vsqrtsd xmm0, xmm0, QWORD PTR [rdi+8]
vmovsd  xmm1, QWORD PTR [rsi+8]
vsqrtsd xmm2, xmm2, QWORD PTR [rdi]
vmaxsd  xmm1, xmm1, xmm0
vmovsd  xmm0, QWORD PTR [rsi]
vmaxsd  xmm0, xmm0, xmm2
vunpcklpd   xmm0, xmm0, xmm1
vmovups XMMWORD PTR [rdi], xmm0
ret

whereas clang can:

  [-std=c++17 -O3 -mavx2 -fno-math-errno]
  f1(Point*, Point*):
vsqrtpd xmm0, xmmword ptr [rdi]
vmovupd xmm1, xmmword ptr [rsi]
vmaxpd  xmm0, xmm1, xmm0
vmovupd xmmword ptr [rdi], xmm0
ret

I think this is a much simpler test-case to start with.

[Bug tree-optimization/87105] Autovectorization [X86, SSE2, AVX2, DoublePrecision]

2018-10-26 Thread kobalicek.petr at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87105

--- Comment #16 from Petr  ---
Thanks a lot! I hope much more code would benefit from this change.

[Bug c++/70708] New: Suboptimal code generated when using _mm_set_sd (X64)

2016-04-17 Thread kobalicek.petr at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70708

Bug ID: 70708
   Summary: Suboptimal code generated when using _mm_set_sd (X64)
   Product: gcc
   Version: 6.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: kobalicek.petr at gmail dot com
  Target Milestone: ---

The ABI already uses XMM registers for floating point operations. Compare the
following two snippets:

  double MyMinV1(double a, double b) {
return a < b ? a : b;
  }

  double MyMinV2(double a, double b) {
__m128d x = _mm_set_sd(a);
__m128d y = _mm_set_sd(b);
return _mm_cvtsd_f64(_mm_min_sd(x, y));
  }

And the code generated:

  MyMinV1(double, double):
minsd   xmm0, xmm1
ret

  MyMinV2(double, double):
movsd   QWORD PTR [rsp-24], xmm0
movsd   QWORD PTR [rsp-16], xmm1
movsd   xmm0, QWORD PTR [rsp-24]
movsd   xmm1, QWORD PTR [rsp-16]
minsd   xmm0, xmm1
ret

The problem is obvious, the _mm_set_sd() intrinsic really generates movsd even
if the content is already in the XMM register in the right place. I checked
also CLang and it generates an optimal code for both functions.

You can reproduce the test-case here:

 
https://gcc.godbolt.org/#compilers:!((compiler:g6,options:'-O2+-Wall+',source:'%23include+%3Cxmmintrin.h%3E%0A%0Adouble+MyMinV1(double+a,+double+b)+%7B%0A++return+a+%3C+b+%3F+a+:+b%3B%0A%7D%0A%0Adouble+MyMinV2(double+a,+double+b)+%7B%0A++__m128d+x+%3D+_mm_set_sd(a)%3B%0A++__m128d+y+%3D+_mm_set_sd(b)%3B%0A++return+_mm_cvtsd_f64(_mm_min_sd(x,+y))%3B%0A%7D%0A')),filterAsm:(commentOnly:!t,directives:!t,intel:!t,labels:!t),version:3

It looks like all GCC versions are affected.

[Bug target/70708] Suboptimal code generated when using _mm_set_sd (X64)

2016-04-18 Thread kobalicek.petr at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70708

--- Comment #3 from Petr  ---
Is there any workaround guys?

I was looking for some built-in that would allow me just cast `double` to
`__m128d` without going through `_mm_set_sd()`, but leaving the high part
undefined.

[Bug sanitizer/81870] New: -fsanitize=undefined doesn't pay attention to __builtin_assume_aligned()

2017-08-16 Thread kobalicek.petr at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81870

Bug ID: 81870
   Summary: -fsanitize=undefined doesn't pay attention to
__builtin_assume_aligned()
   Product: gcc
   Version: 7.1.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: sanitizer
  Assignee: unassigned at gcc dot gnu.org
  Reporter: kobalicek.petr at gmail dot com
CC: dodji at gcc dot gnu.org, dvyukov at gcc dot gnu.org,
jakub at gcc dot gnu.org, kcc at gcc dot gnu.org, marxin at 
gcc dot gnu.org
  Target Milestone: ---

I'm having problem with GCC -fsanitize=undefined and __builtin_assume_aligned()
builtin.

The following code `sanitizer-test.cpp`:

  #include 

  static __attribute((__noinline__)) uint32_t readu32(const void* p) {
p = __builtin_assume_aligned(p, 1);
return static_cast(p)[0];
  }

  static __attribute((__noinline__)) void writeu32(void* p, uint32_t x) {
p = __builtin_assume_aligned(p, 1);
static_cast(p)[0] = x;
  }

  int main(int argc, char* argv[]) {
char buf[] = { 0, 1, 2, 3, 4, 5, 6 };
writeu32(buf + 1, 0x44332211);
uint32_t ret = readu32(buf + 1);
return static_cast(ret);
  }

Compiled as:

  gcc-7 -fsanitize=undefined sanitizer-test.cpp -o sanitizer-test

Outputs the following when executed:

$ ./sanitizer-test
sanitizer-test.cpp:10:32: runtime error: store to misaligned address
0x7ffd643f6ab6 for type 'uint32_t', which requires 4 byte alignment
0x7ffd643f6ab6: note: pointer points here
 3f 64 fd 00 01 02  03 04 05 06 00 00 00 00  60 b8 a8 09 b3 55 00 00  b1 f2 ab
be 80 7f 00 00  01 00
 ^ 
sanitizer-test.cpp:5:43: runtime error: load of misaligned address
0x7ffd643f6ab6 for type 'const uint32_t', which requires 4 byte alignment
0x7ffd643f6ab6: note: pointer points here
 3f 64 fd 00 11 22  33 44 05 06 00 00 00 00  60 b8 a8 09 b3 55 00 00  b1 f2 ab
be 80 7f 00 00  01 00

I think that in this case the sanitizer should not report the runtime error as
the pointer was marked to be aligned to 1 byte.

[Bug sanitizer/81870] -fsanitize=undefined doesn't pay attention to __builtin_assume_aligned()

2017-08-16 Thread kobalicek.petr at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81870

--- Comment #2 from Petr  ---
I see, so if I understand it correctly then:

1. `__builtin_assume_aligned()` should be used to promote the type to a higher
than natural alignment, for example 16 bytes for easier auto-vectorization.

2. `__attribute__((aligned(N)))` should be used to relax alignment of native
types to lower than natural alignment.

It's interesting that with `__builtin_assume_aligned()` I achieved basically
the same effect as with `__attribute__((aligned(N))`, just the sanitizer is not
happy.

Interestingly, I thought that __builtin_assume_aligned() is basically
equivalent to `__assume_aligned()` provided by Intel and MS compilers.

Anyway thanks for your answer, I need to fix my code a bit.

[Bug c++/79830] New: GCC generates counterproductive code surrounding very simple loops (improvement request)

2017-03-03 Thread kobalicek.petr at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79830

Bug ID: 79830
   Summary: GCC generates counterproductive code surrounding very
simple loops (improvement request)
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: kobalicek.petr at gmail dot com
  Target Milestone: ---

It seems that GCC tries very hard to optimize loops, but in my case it's
counterproductive. I have illustrated the problem in the following C++ code and
disassembly.

Loops that are constructed this way need only one variable (`i`) as a loop
counter and use sign flag to check whether the loop is done or not. Typically
such loop requires simple check at the beginning (`sub` and `js`) and at the
end. The purpose of such loop is to save registers and to only require minimal
code surrounding the loop.

However, it seems that GCC tries to convert such loop into something else and
requires a lot of operations to do that, resulting in bigger and slower code.
When using `-Os` GCC produces code that I would expect, however, I don't want
to optimize for size globally.

It's not a compiler bug, but I think that in this case this optimization
doesn't make any sense and only adds to the executable/library size. I doubt
this leads to any improvement and it would be nice if GCC can somehow discover
to not do this for these kind of loops.

Also, here is a compiler explorer URL, for people wanting to compare:

  https://godbolt.org/g/oeDGmy



Consider the following C++ code
---


#include 

#if defined(_MSC_VER)
# include 
#else
# include 
#endif

void transform(double* dst, const double* src, const double* matrix, size_t
length) {
  __m256d m_00_11 = _mm256_castpd128_pd256(_mm_set_pd(matrix[3], matrix[0]));
  __m256d m_10_01 = _mm256_castpd128_pd256(_mm_set_pd(matrix[1], matrix[2]));
  __m256d m_20_21 = _mm256_broadcast_pd(reinterpret_cast(matrix
+ 4));

  m_00_11 = _mm256_permute2f128_pd(m_00_11, m_00_11, 0);
  m_10_01 = _mm256_permute2f128_pd(m_10_01, m_10_01, 0);

  intptr_t i = static_cast(length);
  while ((i -= 8) >= 0) {
__m256d s0 = _mm256_loadu_pd(src +  0);
__m256d s1 = _mm256_loadu_pd(src +  4);
__m256d s2 = _mm256_loadu_pd(src +  8);
__m256d s3 = _mm256_loadu_pd(src + 12);

__m256d a0 = _mm256_add_pd(_mm256_mul_pd(s0, m_00_11), m_20_21);
__m256d a1 = _mm256_add_pd(_mm256_mul_pd(s1, m_00_11), m_20_21);
__m256d a2 = _mm256_add_pd(_mm256_mul_pd(s2, m_00_11), m_20_21);
__m256d a3 = _mm256_add_pd(_mm256_mul_pd(s3, m_00_11), m_20_21);

__m256d b0 = _mm256_mul_pd(_mm256_shuffle_pd(s0, s0, 0x1), m_10_01);
__m256d b1 = _mm256_mul_pd(_mm256_shuffle_pd(s1, s1, 0x1), m_10_01);
__m256d b2 = _mm256_mul_pd(_mm256_shuffle_pd(s2, s2, 0x1), m_10_01);
__m256d b3 = _mm256_mul_pd(_mm256_shuffle_pd(s3, s3, 0x1), m_10_01);

_mm256_storeu_pd(dst +  0, _mm256_add_pd(a0, b0));
_mm256_storeu_pd(dst +  4, _mm256_add_pd(a1, b1));
_mm256_storeu_pd(dst +  8, _mm256_add_pd(a2, b2));
_mm256_storeu_pd(dst + 12, _mm256_add_pd(a3, b3));

dst += 16;
src += 16;
  }
  i += 8;

  while ((i -= 2) >= 0) {
__m256d s0 = _mm256_loadu_pd(src);

__m256d a0 = _mm256_add_pd(_mm256_mul_pd(s0, m_00_11), m_20_21);
__m256d b0 = _mm256_mul_pd(_mm256_shuffle_pd(s0, s0, 0x1), m_10_01);

_mm256_storeu_pd(dst, _mm256_add_pd(a0, b0));

dst += 4;
src += 4;
  }

  if (i & 1) {
__m128d s0 = _mm_loadu_pd(src +  0);

__m128d a0 = _mm_add_pd(_mm_mul_pd(s0, _mm256_castpd256_pd128(m_00_11)),
_mm256_castpd256_pd128(m_20_21));
__m128d b0 = _mm_mul_pd(_mm_shuffle_pd(s0, s0, 0x1),
_mm256_castpd256_pd128(m_10_01));

_mm_storeu_pd(dst +  0, _mm_add_pd(a0, b0));
  }
}



Which is compiled to the following
--

(-O2 -mavx -fno-exceptions -fno-tree-vectorize)

See comments describing what I din't like..

transform(double*, double const*, double const*, unsigned long):
vmovsd  xmm4, QWORD PTR [rdx]
mov r9, rcx
vmovsd  xmm5, QWORD PTR [rdx+16]
sub r9, 8
vmovhpd xmm4, xmm4, QWORD PTR [rdx+24]
vbroadcastf128  ymm6, XMMWORD PTR [rdx+32]
mov r8, rcx
vmovhpd xmm5, xmm5, QWORD PTR [rdx+8]
vperm2f128  ymm4, ymm4, ymm4, 0
vperm2f128  ymm5, ymm5, ymm5, 0
js  .L6

;; <--- Weird
mov rax, r9
sub rcx, 16
mov r8, r9
and rax, -8
mov rdx, rsi
sub rcx, rax
mov rax, rdi
;; <--- Weird
.L5:
vmovupd xmm3, XMMWORD PTR [rdx]
sub r8, 8
sub rax, -128
sub rdx, -128
vinsertf128 ymm3, ymm3, XMMWORD PTR [rdx-112], 0x1

[Bug tree-optimization/79830] GCC generates counterproductive code surrounding very simple loops (improvement request)

2017-03-03 Thread kobalicek.petr at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79830

--- Comment #2 from Petr  ---
I'm not sure I follow with the exit test. I mean the code should be correct as
each point has x|y coord, which is two doubles, so length 8 means 16 doubles (I
converted from my production code into a simpler form that uses only native
types).

However, I think that the problem is also that if this code was handwritten it
would only use 3 registers (dst, src, and i), but GCC uses:

  rax|rcd|rdx|rsi|rdi|r8|r9

which is a lot and the same code in 32-bit mode contains one short spill of GP
register. Basically if I needed more GP registers inside the function the
problem would be much bigger (but no clue if GCC would use different approach
in that case).

[Bug tree-optimization/79830] GCC generates counterproductive code surrounding very simple loops (improvement request)

2017-03-03 Thread kobalicek.petr at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79830

--- Comment #3 from Petr  ---
Sorry for misunderstanding, I really read initially that you replaced the exit
condition in the sample code :)

[Bug tree-optimization/79830] GCC generates counterproductive code surrounding very simple loops (improvement request)

2017-03-04 Thread kobalicek.petr at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79830

--- Comment #4 from Petr  ---
I think the test-case can be simplified to the following code. It still suffers
from the same issues as mentioned above.

#include 

#if defined(_MSC_VER)
# include 
#else
# include 
#endif

void transform(double* dst, const double* src, const double* matrix, size_t
length) {
  intptr_t i = static_cast(length);
  while ((i -= 2) >= 0) {
__m256d s0 = _mm256_loadu_pd(src);
_mm256_storeu_pd(dst, _mm256_add_pd(s0, s0));

dst += 4;
src += 4;
  }

  if (i & 1) {
__m128d s0 = _mm_loadu_pd(src);
_mm_storeu_pd(dst, _mm_add_pd(s0, s0));
  }
}

[Bug inline-asm/79880] New: Gcc refuses to encode vpgatherdd instruction (x86-64)

2017-03-05 Thread kobalicek.petr at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79880

Bug ID: 79880
   Summary: Gcc refuses to encode vpgatherdd instruction (x86-64)
   Product: gcc
   Version: 7.0.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: inline-asm
  Assignee: unassigned at gcc dot gnu.org
  Reporter: kobalicek.petr at gmail dot com
  Target Milestone: ---

I'm unable to encode `vpgatherdd xmm, mem, xmm` instruction in inline asm:

void test() {
  __asm(".intel_syntax\n"
  "vpgatherdd xmm4, [r13 + xmm3], xmm4\n"
  ".att_syntax\n");
}

It seems that ICC refuses this construct as well while clang is fine with that.
I'm not sure if it's bug or this form of the instruction is incorrect. But
according to X86 Architecture Reference Manual it's encodable.

[Bug inline-asm/79880] Gcc refuses to encode vpgatherdd instruction (x86-64)

2017-03-06 Thread kobalicek.petr at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79880

--- Comment #4 from Petr  ---
In this case, DWORD PTR is redundant, nasm and yasm is fine with the syntax I
posted as well. 

It's a simplified test just to show that it won't pass.

Try:

  __asm(".intel_syntax\n"
  "vpgatherdd xmm4, dword ptr [r13 + xmm3], xmm4\n"
  ".att_syntax\n");

I will end up reporting the same error.

[Bug inline-asm/79880] Gcc refuses to encode vpgatherdd instruction (x86-64)

2017-03-06 Thread kobalicek.petr at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79880

--- Comment #6 from Petr  ---
Ok, that's fair enough. I didn't know GCC needs an additional option to switch
to fully compatible Intel syntax. The code that I posted works fine in clang,
so sorry about that.

And yes, the instruction will #UD, but that was the point, initially :)

[Bug c++/77287] New: Much worse code generated compared to clang (stack alignment and spills)

2016-08-18 Thread kobalicek.petr at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77287

Bug ID: 77287
   Summary: Much worse code generated compared to clang (stack
alignment and spills)
   Product: gcc
   Version: 6.1.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: kobalicek.petr at gmail dot com
  Target Milestone: ---

A simple function (artificial code):

#include 

int fn(
  const int* px, const int* py,
  const int* pz, const int* pw,
  const int* pa, const int* pb,
  const int* pc, const int* pd) {

  __m256i a0 = _mm256_loadu_si256((__m256i*)px);
  __m256i a1 = _mm256_loadu_si256((__m256i*)py);
  __m256i a2 = _mm256_loadu_si256((__m256i*)pz);
  __m256i a3 = _mm256_loadu_si256((__m256i*)pw);
  __m256i a4 = _mm256_loadu_si256((__m256i*)pa);
  __m256i b0 = _mm256_loadu_si256((__m256i*)pb);
  __m256i b1 = _mm256_loadu_si256((__m256i*)pc);
  __m256i b2 = _mm256_loadu_si256((__m256i*)pd);
  __m256i b3 = _mm256_loadu_si256((__m256i*)pc + 1);
  __m256i b4 = _mm256_loadu_si256((__m256i*)pd + 1);

  __m256i x0 = _mm256_packus_epi16(a0, b0);
  __m256i x1 = _mm256_packus_epi16(a1, b1);
  __m256i x2 = _mm256_packus_epi16(a2, b2);
  __m256i x3 = _mm256_packus_epi16(a3, b3);
  __m256i x4 = _mm256_packus_epi16(a4, b4);

  x0 = _mm256_add_epi16(x0, a0);
  x1 = _mm256_add_epi16(x1, a1);
  x2 = _mm256_add_epi16(x2, a2);
  x3 = _mm256_add_epi16(x3, a3);
  x4 = _mm256_add_epi16(x4, a4);

  x0 = _mm256_sub_epi16(x0, b0);
  x1 = _mm256_sub_epi16(x1, b1);
  x2 = _mm256_sub_epi16(x2, b2);
  x3 = _mm256_sub_epi16(x3, b3);
  x4 = _mm256_sub_epi16(x4, b4);

  x0 = _mm256_packus_epi16(x0, x1);
  x0 = _mm256_packus_epi16(x0, x2);
  x0 = _mm256_packus_epi16(x0, x3);
  x0 = _mm256_packus_epi16(x0, x4);
  return _mm256_extract_epi32(x0, 1);
}

Produces the following asm when compiled by GCC (annotated by me):

  ; GCC 6.1 -O2 -Wall -mavx2 -m32 -fomit-frame-pointer
  lea   ecx, [esp+4]  ; Return address
  and   esp, -32  ; Align the stack to 32 bytes
  push  DWORD PTR [ecx-4] ; Push returned address
  push  ebp   ; Save frame-pointer even if I
told GCC to not to
  mov   ebp, esp
  push  edi   ; Save GP regs
  push  esi
  push  ebx
  push  ecx
  sub   esp, 296  ; Reserve stack for YMM spills
  mov   eax, DWORD PTR [ecx+16]   ; LOAD 'pa'
  mov   esi, DWORD PTR [ecx+4]; LOAD 'py'
  mov   edi, DWORD PTR [ecx]  ; LOAD 'px'
  mov   ebx, DWORD PTR [ecx+8]; LOAD 'pz'
  mov   edx, DWORD PTR [ecx+12]   ; LOAD 'pw'
  mov   DWORD PTR [ebp-120], eax  ; SPILL 'pa'
  mov   eax, DWORD PTR [ecx+20]   ; LOAD 'pb'
  mov   DWORD PTR [ebp-152], eax  ; SPILL 'pb'
  mov   eax, DWORD PTR [ecx+24]   ; LOAD 'pc'
  vmovdqu   ymm4, YMMWORD PTR [esi]
  mov   ecx, DWORD PTR [ecx+28]   ; LOAD 'pd'
  vmovdqu   ymm7, YMMWORD PTR [edi]
  vmovdqa   YMMWORD PTR [ebp-56], ymm4; SPILL VEC
  vmovdqu   ymm4, YMMWORD PTR [ebx]
  mov   ebx, DWORD PTR [ebp-152]  ; LOAD 'pb'
  vmovdqa   YMMWORD PTR [ebp-88], ymm4; SPILL VEC
  vmovdqu   ymm4, YMMWORD PTR [edx]
  mov   edx, DWORD PTR [ebp-120]  ; LOAD 'pa'
  vmovdqu   ymm6, YMMWORD PTR [edx]
  vmovdqa   YMMWORD PTR [ebp-120], ymm6   ; SPILL VEC
  vmovdqu   ymm0, YMMWORD PTR [ecx]
  vmovdqu   ymm6, YMMWORD PTR [ebx]
  vmovdqa   ymm5, ymm0; Why to move anything when using
AVX?
  vmovdqu   ymm0, YMMWORD PTR [eax+32]
  vmovdqu   ymm2, YMMWORD PTR [eax]
  vmovdqa   ymm1, ymm0; Why to move anything when using
AVX?
  vmovdqu   ymm0, YMMWORD PTR [ecx+32]
  vmovdqa   YMMWORD PTR [ebp-152], ymm2
  vmovdqa   ymm3, ymm0; Why to move anything when using
AVX?
  vpackuswb ymm0, ymm7, ymm6
  vmovdqa   YMMWORD PTR [ebp-184], ymm5   ; SPILL VEC
  vmovdqa   YMMWORD PTR [ebp-248], ymm3   ; SPILL VEC
  vmovdqa   YMMWORD PTR [ebp-280], ymm0   ; SPILL VEC
  vmovdqa   ymm0, YMMWORD PTR [ebp-56]; ALLOC VEC
  vmovdqa   YMMWORD PTR [ebp-216], ymm1   ; SPILL VEC
  vpackuswb ymm2, ymm0, YMMWORD PTR [ebp-152] ; Uses SPILL slot
  vmovdqa   ymm0, YMMWORD PTR [ebp-88]; ALLOC VEC
  vpackuswb ymm1, ymm4, YMMWORD PTR [ebp-216] ; Uses SPILL slot
  vpackuswb ymm5, ymm0, YMMWORD PTR [ebp-184] ; Uses SPILL slot
  vmovdqa   ymm0, YMMWORD PTR [ebp-120]   ; ALLOC VEC
  vpaddwymm2, ymm2, YMMWORD PTR [ebp-56]  ; Uses SPILL slot
  vpsubwymm2, ymm2, YMMWORD PTR [ebp-152] ; Uses SPILL slot
  vpackuswb ymm3, ymm0, YMMWORD PTR [ebp-248] ; Uses 

[Bug target/77287] Much worse code generated compared to clang (stack alignment and spills)

2016-08-18 Thread kobalicek.petr at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77287

--- Comment #2 from Petr  ---
With '-mtune=intel' the push/pop sequence is gone, but YMM register management
remains the same - 24 memory accesses more than clang.

[Bug rtl-optimization/77287] Much worse code generated compared to clang (stack alignment and spills)

2016-08-20 Thread kobalicek.petr at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77287

--- Comment #4 from Petr  ---
Adding -fschedule-insns is definitely a huge improvement in this case. I wonder
why this doesn't happen by default at -O2 and -Os, as it really improves things
and makes shorter output, or it's just in this particular case?

Here is the assembly produced by gcc with -fschedule-insns:
  push  ebp
  mov   ebp, esp
  and   esp, -32
  lea   esp, [esp-32]
  mov   ecx, DWORD PTR [ebp+8]
  mov   edx, DWORD PTR [ebp+32]
  mov   eax, DWORD PTR [ebp+36]
  vmovdqu   ymm5, YMMWORD PTR [ecx]
  mov   ecx, DWORD PTR [ebp+12]
  vmovdqu   ymm3, YMMWORD PTR [edx]
  vmovdqu   ymm6, YMMWORD PTR [eax]
  vmovdqu   ymm2, YMMWORD PTR [ecx]
  mov   ecx, DWORD PTR [ebp+28]
  vpackuswb ymm7, ymm2, ymm3
  vpaddwymm7, ymm7, ymm2
  vpsubwymm7, ymm7, ymm3
  vmovdqu   ymm4, YMMWORD PTR [ecx]
  mov   ecx, DWORD PTR [ebp+16]
  vpackuswb ymm0, ymm5, ymm4
  vpaddwymm0, ymm0, ymm5
  vpsubwymm0, ymm0, ymm4
  vmovdqu   ymm1, YMMWORD PTR [ecx]
  vpackuswb ymm0, ymm0, ymm7
  mov   ecx, DWORD PTR [ebp+20]
  vpackuswb ymm2, ymm1, ymm6
  vmovdqu   ymm4, YMMWORD PTR [edx+32]
  vpaddwymm1, ymm2, ymm1
  mov   edx, DWORD PTR [ebp+24]
  vpsubwymm1, ymm1, ymm6
  vmovdqu   ymm5, YMMWORD PTR [ecx]
  vpackuswb ymm0, ymm0, ymm1
  vpackuswb ymm3, ymm5, ymm4
  vmovdqa   YMMWORD PTR [esp], ymm3
  vmovdqu   ymm2, YMMWORD PTR [eax+32]; LOOK HERE
  vpaddwymm5, ymm5, YMMWORD PTR [esp]
  vmovdqu   ymm3, YMMWORD PTR [edx]   ; AND HERE
  vpsubwymm4, ymm5, ymm4
  vpackuswb ymm7, ymm3, ymm2
  vpackuswb ymm0, ymm0, ymm4
  vpaddwymm3, ymm7, ymm3
  vpsubwymm2, ymm3, ymm2
  vpackuswb ymm2, ymm0, ymm2
  vpextrd   eax, xmm2, 1
  vzeroupper
  leave
  ret

Which is pretty close to clang already, however, look at this part:

  vmovdqa   YMMWORD PTR [esp], ymm3   ; Spill YMM3
  vmovdqu   ymm2, YMMWORD PTR [eax+32]
  vpaddwymm5, ymm5, YMMWORD PTR [esp] ; Mem instead of YMM3?
  vmovdqu   ymm3, YMMWORD PTR [edx]   ; Old YMM3 becomes dead here

The spill is completely unnecessary in our case, and it's the only reason why
the prolog/epilog requires code to perform dynamic stack alignment. I mean if
this one thing is eliminated then GCC basically generates a comparable code to
clang.

But thanks for -fschedule-insns hint, I didn't know about it.

[Bug c++/103699] New: Reading or writing unaligned integers is wrongly optimized (GCC-11 and up)

2021-12-13 Thread kobalicek.petr at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103699

Bug ID: 103699
   Summary: Reading or writing unaligned integers is wrongly
optimized (GCC-11 and up)
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: kobalicek.petr at gmail dot com
  Target Milestone: ---

I have found a strange issue. When I use __attribute__((aligned(1)) on a type
to essentially annotate its lower alignment for memory load/store purposes, the
compiler would optimize such loads/stores out without any warnings. I have
never had a problem with this in GCC nor Clang, only GCC 11+ generates wrong
code for me.

Best illustrated in the code below [Compile with -O2 -std=c++17]

#include 

typedef uint32_t __attribute__((__aligned__(1))) UnalignedUInt32;
typedef uint64_t __attribute__((__aligned__(1))) UnalignedUInt64;

uint32_t byteswap32(uint32_t x) noexcept {
  return (x << 24) | (x >> 24) | ((x << 8) & 0x00FFu) | ((x >> 8) &
0xFF00);
}

uint64_t byteswap64(uint64_t x) noexcept {
  return ((x << 56) & 0xff00) |
 ((x << 40) & 0x00ff) |
 ((x << 24) & 0xff00) |
 ((x <<  8) & 0x00ff) |
 ((x >>  8) & 0xff00) |
 ((x >> 24) & 0x00ff) |
 ((x >> 40) & 0xff00) |
 ((x >> 56) & 0x00ff);
}

static inline void writeU64be(void* p, uint64_t val) {
  static_cast(p)[0] = byteswap64(val);
}

static inline uint32_t readU32be(const void* p) noexcept {
  uint32_t x = static_cast(p)[0];
  return byteswap32(x);
}

// Returns 0xBB
uint32_t test_1() {
uint8_t array[16] {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15};
writeU64be(array + 6, 0xAABBCCDDEEFF1213);
return array[7];
}

// Returns 0xCC
uint32_t test_2() {
uint8_t array[16] {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15};
writeU64be(array + 6, 0xAABBCCDDEEFF1213);
return array[8];
}

// Returns 0xDD
uint32_t test_3() {
uint8_t array[16] {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15};
writeU64be(array + 6, 0xAABBCCDDEEFF1213);
return array[9];
}

// Returns 0xEE
uint32_t test_4() {
uint8_t array[16] {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15};
writeU64be(array + 6, 0xAABBCCDDEEFF1213);
return array[10];
}

// Returns 0708090A - the write has no effect when read with readU32be()
uint32_t test_u32() {
uint8_t array[16] {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15};
writeU64be(array + 6, 0xAABBCCDDEEFF1213);
return readU32be(array + 7);
}

So I'm wondering, is this a correct behavior?

It seems like a bug in the optimizer to me, because when the code is dynamic
(the data is not consts) it seems to work as expected. I have found it in a
failing unit test (GCC 11 is the only compiler that fails).

Compiler Explorer: https://godbolt.org/z/9G9cx83oq

[Bug c++/103699] Reading or writing a constant unaligned value is wrongly optimized causing an incorrect result (GCC-11 and up)

2021-12-13 Thread kobalicek.petr at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103699

--- Comment #2 from Petr  ---
If you compile this with clang the function test_u32() will corretly return the
expected 0xBBCCDDEE and not 0x0708090A.

If you compile with older GCC, like GCC 10, the test would also return
0xBBCCDDEE. Only GCC-11 and upward return 0x0708090A.

Based on the documentation here:

https://gcc.gnu.org/onlinedocs/gcc/Common-Variable-Attributes.html#Common-Variable-Attributes

It's stated that "When used as part of a typedef, the aligned attribute can
both increase and decrease alignment, and specifying the packed attribute
generates a warning." - it explicitly allows to create a type with lower
alignment, so I don't consider this undefined behavior - in general UBSAN is
fine with this construct.

[Bug c++/103699] Reading or writing a constant unaligned value is wrongly optimized causing an incorrect result (GCC-11 and up)

2021-12-13 Thread kobalicek.petr at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103699

--- Comment #3 from Petr  ---
BTW this almost seems like an optimizer bug, because if you compile the code
without optimizations with GCC 11 (or with -O1) it also returns the expected
value - only optimized compilation with GCC 11 returns the wrong one.

[Bug c++/103699] Reading or writing a constant unaligned value is wrongly optimized causing an incorrect result (GCC-11 and up)

2021-12-13 Thread kobalicek.petr at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103699

--- Comment #4 from Petr  ---
Additional test case:

#include 
#include 

typedef uint32_t __attribute__((__aligned__(1))) UnalignedUInt32;
typedef uint64_t __attribute__((__aligned__(1))) UnalignedUInt64;

uint32_t byteswap32(uint32_t x) noexcept {
  return (x << 24) | (x >> 24) | ((x << 8) & 0x00FFu) | ((x >> 8) &
0xFF00);
}

uint64_t byteswap64(uint64_t x) noexcept {
  return ((x << 56) & 0xff00) |
 ((x << 40) & 0x00ff) |
 ((x << 24) & 0xff00) |
 ((x <<  8) & 0x00ff) |
 ((x >>  8) & 0xff00) |
 ((x >> 24) & 0x00ff) |
 ((x >> 40) & 0xff00) |
 ((x >> 56) & 0x00ff);
}

static inline void writeU64be(void* p, uint64_t val) {
  static_cast(p)[0] = byteswap64(val);
}

static inline uint32_t readU32be(const void* p) noexcept {
  uint32_t x = static_cast(p)[0];
  return byteswap32(x);
}

// Returns 0708090A
uint32_t test_u32() {
uint8_t array[16] {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15};
writeU64be(array + 6, 0xAABBCCDDEEFF1213);
return readU32be(array + 7);
}

static uint8_t array_static[16] {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15};

int main() {
printf("%08X\n", test_u32());

writeU64be(array_static + 6, 0xAABBCCDDEEFF1213);
printf("%08X\n", readU32be(array_static + 7));
return 0;
}

It prints:
0708090A
BBCCDDEE

Clang prints:
BBCCDDEE
BBCCDDEE

[Bug c++/103699] Reading or writing a constant unaligned value is wrongly optimized causing an incorrect result (GCC-11 and up)

2021-12-13 Thread kobalicek.petr at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103699

--- Comment #6 from Petr  ---
For now I have disabled unaligned load/store optimizations in my projects when
dealing with GCC 11 and upwards.

I still think that GCC is wrong in this case regardless of strict aliasing. The
code in func_u32() is essentially creating a constant, and GCC 11+ is the only
compiler returning it wrong and also inconsistently between optimization
levels.

[Bug c++/103699] Reading or writing a constant unaligned value is wrongly optimized causing an incorrect result (GCC-11 and up)

2021-12-13 Thread kobalicek.petr at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103699

--- Comment #8 from Petr  ---
My only problem is that A returns a different value compared to B, C, and D:

uint32_t test_u32_a() {
char array[16] {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15};
writeU64be(array + 6, 0xAABBCCDDEEFF1213);
return readU32be(array + 7);
}

uint32_t test_u32_b() {
static char array[16] {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15};
writeU64be(array + 6, 0xAABBCCDDEEFF1213);
return readU32be(array + 7);
}

uint32_t test_u32_c() {
thread_local char array[16] {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15};
writeU64be(array + 6, 0xAABBCCDDEEFF1213);
return readU32be(array + 7);
}

uint32_t test_u32_d(char array[16]) {
writeU64be(array + 6, 0xAABBCCDDEEFF1213);
return readU32be(array + 7);
}

And when you compile this, you would actually see that ALL functions evaluate
to a constant (because it's known what the output will be), but only in A case
the constant is different (of course because B, C, D have side effects):

test_u32_a():
mov eax, 117967114
ret
test_u32_b():
movabs  rax, 1374442237689904042
mov QWORD PTR test_u32_b()::array[rip+6], rax
mov eax, -1144201746
ret
test_u32_c():
movabs  rax, 1374442237689904042
mov QWORD PTR fs:test_u32_c()::array@tpoff+6, rax
mov eax, -1144201746
ret
test_u32_d(char*):
movabs  rax, 1374442237689904042
mov QWORD PTR [rdi+6], rax
mov eax, -1144201746
ret

So yeah, we can talk about breaking strict aliasing here, but it's just
inconsistent. I would just expect all functions return the same value.

[Bug c++/103699] Reading or writing a constant unaligned value is wrongly optimized causing an incorrect result (GCC-11 and up)

2021-12-14 Thread kobalicek.petr at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103699

--- Comment #10 from Petr  ---
Well, the problem is, that when you compile it with "-fsanitize=undefined" - it
won't report any undefined behavior, and the function would return the expected
value.

I even tried to make everything constexpr - and constexpr by definition should
never involve undefined behavior, right? But GCC 11 compiles the code with
constexpr, doesn't complain against undefined behavior, and again, returns the
wrong value.

What I miss here, from user perspective, is some kind of diagnostics. If you
remove code that is provable to have effect on the code following it, why not
to complain, at least in constexpr case?

[Bug c++/103699] Reading or writing a constant unaligned value is wrongly optimized causing an incorrect result (GCC-11 and up)

2021-12-14 Thread kobalicek.petr at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103699

--- Comment #12 from Petr  ---
Is there a way to diagnose this? To tell GCC to report transformations that
basically cause wrong results returned?

In my code base, I have unaligned memory loads/stores abstracted, so I can
implement whatever compiler specific construct I need to make such accesses
optimized or friendly to the compiler - and this method that I have shown here,
I didn't invent it. I would even say that I only see this error in unit tests,
which can be basically constant evaluated like shown in my previous examples.

But it was kinda surprising that GCC 11+ is the only compiler that started
failing my tests, and that no analyzer basically complains about this (nor
clang static analysis, for example).

[Bug c++/103699] Reading or writing a constant unaligned value is wrongly optimized causing an incorrect result (GCC-11 and up)

2021-12-14 Thread kobalicek.petr at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103699

--- Comment #15 from Petr  ---
Unfortunately GCC doesn't report any issues even with `-Wstrict-aliasing=1`.

BTW now I know I must use the may_alias attribute to my satisfaction, and this
is what I'm gonna do, however, from user perspective I'm not really happy as
GCC does something very silently. Personally, I would be happier if my code
doesn't compile at all than having it compiled with bugs.

[Bug c++/103699] Reading or writing a constant unaligned value is wrongly optimized causing an incorrect result (GCC-11 and up)

2021-12-14 Thread kobalicek.petr at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103699

--- Comment #17 from Petr  ---
Guys thanks a lot for your feedback.

Is the may_alias annotation guaranteed to behave as expected in the future
versions of GCC too, or it's just too much UB that it's better to do unaligned
reads with memcpy?

What I like on my existing solution is that I can specify alignment, so for
example I can say that this 64-bit load is 4-byte aligned, and that hints a
compiler, I'm not sure how to hint that with memcpy. So far annotating the code
with may_alias works for me, it passes tests, but I'm kinda unsure whether this
is a good way to express unaligned memory access considering I have faced this
issue.

[Bug target/77287] Much worse code generated compared to clang (stack alignment and spills)

2021-08-24 Thread kobalicek.petr at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77287

--- Comment #6 from Petr  ---
Yes, the code is not really doing anything useful, I only wrote it to
demonstrate the spills problem. Clang actually outsmarted me by removing half
of the code :)

I think this issue can be closed, I cannot repro this with the newest GCC.

[Bug c++/116738] New: Constant folding of _mm_min_ss and _mm_max_ss is wrong

2024-09-16 Thread kobalicek.petr at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116738

Bug ID: 116738
   Summary: Constant folding of _mm_min_ss and _mm_max_ss is wrong
   Product: gcc
   Version: 14.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: kobalicek.petr at gmail dot com
  Target Milestone: ---

GCC incorrectly optimizes x86 intrinsics, which have a defined operation at the
ISA level. It seems that the problem happens when a value is known at compile
time, hence constant folding uses a different operation compared to the CPU
when executed as an instruction.

Here is the definition of [V]MINSS:

```
MIN(SRC1, SRC2)
{
IF ((SRC1 = 0.0) and (SRC2 = 0.0)) THEN DEST := SRC2;
ELSE IF (SRC1 = NaN) THEN DEST := SRC2; FI;
ELSE IF (SRC2 = NaN) THEN DEST := SRC2; FI;
ELSE IF (SRC1 < SRC2) THEN DEST := SRC1;
ELSE DEST := SRC2;
FI;
}
```

So, it's clear that the SRC1 is only selected when an ordered comparison `SRC1
< SRC2` is true. However, GCC doesn't seem to respect this detail. Here is a
test case that I was able to craft:

```
// Demonstration of a GCC bug in constant folding of SIMD intrinsics:
#include 
#include 
#include 
#include 

float clamp(float f) {
__m128 v = _mm_set_ss(f);
__m128 zero = _mm_setzero_ps();
__m128 greatest = _mm_set_ss(std::numeric_limits::max());

v = _mm_min_ss(v, greatest);
v = _mm_max_ss(v, zero);

return _mm_cvtss_f32(v);
}

int main() {
printf("clamp(-0) -> %f\n", clamp(-0.0f));
printf("clamp(nan) -> %f\n",
clamp(std::numeric_limits::quiet_NaN()));
return 0;
}
```

GCC results (wrong):

clamp(-0) -> -0.00
clamp(nan) -> nan

Clang results (expected):

clamp(-0) -> 0.00
clamp(nan) -> 340282346638528859811704183484516925440.00

Here is a compiler explorer link:

- https://godbolt.org/z/6afjoaj86

I'm aware this is a possible duplicate of an [UNCONFIRMED] bug:

- https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99497

However, fast-math is mentioned in that bug report, and I'm not interested in
fast-math at all, I'm not using that option.

This bug makes it impossible to create test cases for the implementation of
some optimized functions that I use as tests with GCC fail, but other compilers
produce correct results.

A possible workaround is to use _ps instead of _ss variant of the intrinsics,
but that's also something I would like to avoid as in some cases I really work
with a scalar value only.

Also interestingly, when compiled by GCC in debug mode (without optimizations)
GCC behaves correctly, so this bug is related to the optimization pipeline as
well.

I'm not aware of any UB in this test case.

[Bug target/116738] Constant folding of _mm_min_ss and _mm_max_ss is wrong

2024-09-17 Thread kobalicek.petr at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116738

--- Comment #3 from Petr  ---
Maybe marking it as confirmed would be appropriate then?

I think as a workaround it would be better to not constant fold code that GCC
cannot compute properly - that would mean properly calculating the values at
runtime. I have no idea what else this would impact though.

[Bug target/116738] Constant folding of _mm_min_ss and _mm_max_ss is wrong

2024-09-17 Thread kobalicek.petr at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116738

--- Comment #7 from Petr  ---
The simplified test case looks good except for a missing return :)