[Bug c++/87105] New: Autovectorization [X86, SSE2, AVX2, DoublePrecision]
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87105 Bug ID: 87105 Summary: Autovectorization [X86, SSE2, AVX2, DoublePrecision] Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: kobalicek.petr at gmail dot com Target Milestone: --- GCC is unable to autovectorize the following code. It seems that it doesn't like min/max, but I'm not entirely sure. I stripped the code off my project so it's a bit longer, hope that's fine. I attached also a code compiled by clang, which is perfectly vectorized and what I would like to get from GCC. The demonstration code -- #include #include #include // Point structure [x, y] struct Point { double x, y; inline Point() noexcept = default; constexpr Point(const Point&) noexcept = default; constexpr Point(double x, double y) noexcept : x(x), y(y) {} }; // Box structure [x0, y0, x1, y1] struct Box { double x0, y0, x1, y1; inline void reset(double x0, double y0, double x1, double y1) noexcept { this->x0 = x0; this->y0 = y0; this->x1 = x1; this->y1 = y1; } }; // Overloads to make vector processing simpler. static constexpr Point operator-(const Point& a) noexcept { return Point(-a.x, -a.y); } static constexpr Point operator+(const Point& a, double b) noexcept { return Point(a.x + b, a.y + b); } static constexpr Point operator-(const Point& a, double b) noexcept { return Point(a.x - b, a.y - b); } static constexpr Point operator*(const Point& a, double b) noexcept { return Point(a.x * b, a.y * b); } static constexpr Point operator/(const Point& a, double b) noexcept { return Point(a.x / b, a.y / b); } static constexpr Point operator+(const Point& a, const Point& b) noexcept { return Point(a.x + b.x, a.y + b.y); } static constexpr Point operator-(const Point& a, const Point& b) noexcept { return Point(a.x - b.x, a.y - b.y); } static constexpr Point operator*(const Point& a, const Point& b) noexcept { return Point(a.x * b.x, a.y * b.y); } static constexpr Point operator/(const Point& a, const Point& b) noexcept { return Point(a.x / b.x, a.y / b.y); } static constexpr Point operator+(double a, const Point& b) noexcept { return Point(a + b.x, a + b.y); } static constexpr Point operator-(double a, const Point& b) noexcept { return Point(a - b.x, a - b.y); } static constexpr Point operator*(double a, const Point& b) noexcept { return Point(a * b.x, a * b.y); } static constexpr Point operator/(double a, const Point& b) noexcept { return Point(a / b.x, a / b.y); } // Min/Max - different semantics compared to std. template constexpr T myMin(const T& a, const T& b) noexcept { return b < a ? b : a; } template constexpr T myMax(const T& a, const T& b) noexcept { return a < b ? b : a; } // Linear interpolation, works with points as well. template inline V lerp(const V& a, const V& b, const T& t) noexcept { return (a * (1.0 - t)) + (b * t); } // Merge a point into a box by possibly increasing its bounds. inline void boxMergePoint(Box& box, const Point& p) noexcept { box.x0 = myMin(box.x0, p.x); box.y0 = myMin(box.y0, p.y); box.x1 = myMax(box.x1, p.x); box.y1 = myMax(box.y1, p.y); } void quadBoundingBoxA(const Point bez[3], Box& bBox) noexcept { // Bounding box of start and end points. bBox.reset(myMin(bez[0].x, bez[2].x), myMin(bez[0].y, bez[2].y), myMax(bez[0].x, bez[2].x), myMax(bez[0].y, bez[2].y)); Point t = (bez[0] - bez[1]) / (bez[0] - bez[1] * 2.0 + bez[2]); t.x = myMax(t.x, 0.0); t.y = myMax(t.y, 0.0); t.x = myMin(t.x, 1.0); t.y = myMin(t.y, 1.0); boxMergePoint(bBox, lerp(lerp(bez[0], bez[1], t), lerp(bez[1], bez[2], t), t)); } GCC Output [-std=c++17 -O3 -mavx2 -fno-math-errno] -- quadBoundingBoxA(Point const*, Box&): pushrbp mov rbp, rsp and rsp, -32 vmovsd xmm1, QWORD PTR [rdi+8] vmovsd xmm0, QWORD PTR [rdi] vmovsd xmm5, QWORD PTR [rdi+40] vmovsd xmm6, QWORD PTR [rdi+32] vmaxsd xmm13, xmm5, xmm1 vmaxsd xmm12, xmm6, xmm0 vminsd xmm5, xmm5, xmm1 vminsd xmm6, xmm6, xmm0 vunpcklpd xmm0, xmm12, xmm13 vunpcklpd xmm1, xmm6, xmm5 vmovups XMMWORD PTR [rsi+16], xmm0 vmovups XMMWORD PTR [rsi], xmm1 vmovsd xmm2, QWORD PTR [rdi+24] vmovsd xmm10, QWORD PTR [rdi+8] vmovsd xmm1, QWORD PTR [rdi+40] vmovsd xmm7, QWORD PTR [rdi+16] vaddsd xmm4, xmm2, xmm2 vsubsd xmm9, xmm10, xmm2 vmovsd xmm3, QWORD PTR [rdi] vmovsd xmm0, QWORD PTR [r
[Bug tree-optimization/87105] Autovectorization [X86, SSE2, AVX2, DoublePrecision]
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87105 --- Comment #4 from Petr --- I think this code is vectorizable without --fast-math. However, it seems that once a min/max (or something else) is kept scalar it poisons the rest of the code. The following code works perfectly (scalar): ``` #include template constexpr T altMinT(const T& a, const T& b) noexcept { return b < a ? b : a; } template constexpr T altMaxT(const T& a, const T& b) noexcept { return a < b ? b : a; } double std_min(double a, double b) { return std::min(a, b); } double std_max(double a, double b) { return std::max(a, b); } double alt_min(double a, double b) { return altMinT(a, b); } double alt_max(double a, double b) { return altMaxT(a, b); } ``` I think that's the main problem - little samples are optimized well, complex code isn't.
[Bug tree-optimization/87105] Autovectorization [X86, SSE2, AVX2, DoublePrecision]
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87105 --- Comment #6 from Petr --- I think the test-case can even be simplified to something like this: #include #include struct Point { double x, y; void reset(double x, double y) { this->x = x; this->y = y; } }; void f1(Point* p, Point* a) { p->reset(std::max(std::sqrt(p->x), a->x), std::max(std::sqrt(p->y), a->y)); } GCC is unable to vectorize it: [-std=c++17 -O3 -mavx2 -fno-math-errno] f1(Point*, Point*): vsqrtsd xmm0, xmm0, QWORD PTR [rdi+8] vmovsd xmm1, QWORD PTR [rsi+8] vsqrtsd xmm2, xmm2, QWORD PTR [rdi] vmaxsd xmm1, xmm1, xmm0 vmovsd xmm0, QWORD PTR [rsi] vmaxsd xmm0, xmm0, xmm2 vunpcklpd xmm0, xmm0, xmm1 vmovups XMMWORD PTR [rdi], xmm0 ret whereas clang can: [-std=c++17 -O3 -mavx2 -fno-math-errno] f1(Point*, Point*): vsqrtpd xmm0, xmmword ptr [rdi] vmovupd xmm1, xmmword ptr [rsi] vmaxpd xmm0, xmm1, xmm0 vmovupd xmmword ptr [rdi], xmm0 ret I think this is a much simpler test-case to start with.
[Bug tree-optimization/87105] Autovectorization [X86, SSE2, AVX2, DoublePrecision]
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=87105 --- Comment #16 from Petr --- Thanks a lot! I hope much more code would benefit from this change.
[Bug c++/70708] New: Suboptimal code generated when using _mm_set_sd (X64)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70708 Bug ID: 70708 Summary: Suboptimal code generated when using _mm_set_sd (X64) Product: gcc Version: 6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: kobalicek.petr at gmail dot com Target Milestone: --- The ABI already uses XMM registers for floating point operations. Compare the following two snippets: double MyMinV1(double a, double b) { return a < b ? a : b; } double MyMinV2(double a, double b) { __m128d x = _mm_set_sd(a); __m128d y = _mm_set_sd(b); return _mm_cvtsd_f64(_mm_min_sd(x, y)); } And the code generated: MyMinV1(double, double): minsd xmm0, xmm1 ret MyMinV2(double, double): movsd QWORD PTR [rsp-24], xmm0 movsd QWORD PTR [rsp-16], xmm1 movsd xmm0, QWORD PTR [rsp-24] movsd xmm1, QWORD PTR [rsp-16] minsd xmm0, xmm1 ret The problem is obvious, the _mm_set_sd() intrinsic really generates movsd even if the content is already in the XMM register in the right place. I checked also CLang and it generates an optimal code for both functions. You can reproduce the test-case here: https://gcc.godbolt.org/#compilers:!((compiler:g6,options:'-O2+-Wall+',source:'%23include+%3Cxmmintrin.h%3E%0A%0Adouble+MyMinV1(double+a,+double+b)+%7B%0A++return+a+%3C+b+%3F+a+:+b%3B%0A%7D%0A%0Adouble+MyMinV2(double+a,+double+b)+%7B%0A++__m128d+x+%3D+_mm_set_sd(a)%3B%0A++__m128d+y+%3D+_mm_set_sd(b)%3B%0A++return+_mm_cvtsd_f64(_mm_min_sd(x,+y))%3B%0A%7D%0A')),filterAsm:(commentOnly:!t,directives:!t,intel:!t,labels:!t),version:3 It looks like all GCC versions are affected.
[Bug target/70708] Suboptimal code generated when using _mm_set_sd (X64)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70708 --- Comment #3 from Petr --- Is there any workaround guys? I was looking for some built-in that would allow me just cast `double` to `__m128d` without going through `_mm_set_sd()`, but leaving the high part undefined.
[Bug sanitizer/81870] New: -fsanitize=undefined doesn't pay attention to __builtin_assume_aligned()
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81870 Bug ID: 81870 Summary: -fsanitize=undefined doesn't pay attention to __builtin_assume_aligned() Product: gcc Version: 7.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: sanitizer Assignee: unassigned at gcc dot gnu.org Reporter: kobalicek.petr at gmail dot com CC: dodji at gcc dot gnu.org, dvyukov at gcc dot gnu.org, jakub at gcc dot gnu.org, kcc at gcc dot gnu.org, marxin at gcc dot gnu.org Target Milestone: --- I'm having problem with GCC -fsanitize=undefined and __builtin_assume_aligned() builtin. The following code `sanitizer-test.cpp`: #include static __attribute((__noinline__)) uint32_t readu32(const void* p) { p = __builtin_assume_aligned(p, 1); return static_cast(p)[0]; } static __attribute((__noinline__)) void writeu32(void* p, uint32_t x) { p = __builtin_assume_aligned(p, 1); static_cast(p)[0] = x; } int main(int argc, char* argv[]) { char buf[] = { 0, 1, 2, 3, 4, 5, 6 }; writeu32(buf + 1, 0x44332211); uint32_t ret = readu32(buf + 1); return static_cast(ret); } Compiled as: gcc-7 -fsanitize=undefined sanitizer-test.cpp -o sanitizer-test Outputs the following when executed: $ ./sanitizer-test sanitizer-test.cpp:10:32: runtime error: store to misaligned address 0x7ffd643f6ab6 for type 'uint32_t', which requires 4 byte alignment 0x7ffd643f6ab6: note: pointer points here 3f 64 fd 00 01 02 03 04 05 06 00 00 00 00 60 b8 a8 09 b3 55 00 00 b1 f2 ab be 80 7f 00 00 01 00 ^ sanitizer-test.cpp:5:43: runtime error: load of misaligned address 0x7ffd643f6ab6 for type 'const uint32_t', which requires 4 byte alignment 0x7ffd643f6ab6: note: pointer points here 3f 64 fd 00 11 22 33 44 05 06 00 00 00 00 60 b8 a8 09 b3 55 00 00 b1 f2 ab be 80 7f 00 00 01 00 I think that in this case the sanitizer should not report the runtime error as the pointer was marked to be aligned to 1 byte.
[Bug sanitizer/81870] -fsanitize=undefined doesn't pay attention to __builtin_assume_aligned()
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81870 --- Comment #2 from Petr --- I see, so if I understand it correctly then: 1. `__builtin_assume_aligned()` should be used to promote the type to a higher than natural alignment, for example 16 bytes for easier auto-vectorization. 2. `__attribute__((aligned(N)))` should be used to relax alignment of native types to lower than natural alignment. It's interesting that with `__builtin_assume_aligned()` I achieved basically the same effect as with `__attribute__((aligned(N))`, just the sanitizer is not happy. Interestingly, I thought that __builtin_assume_aligned() is basically equivalent to `__assume_aligned()` provided by Intel and MS compilers. Anyway thanks for your answer, I need to fix my code a bit.
[Bug c++/79830] New: GCC generates counterproductive code surrounding very simple loops (improvement request)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79830 Bug ID: 79830 Summary: GCC generates counterproductive code surrounding very simple loops (improvement request) Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: kobalicek.petr at gmail dot com Target Milestone: --- It seems that GCC tries very hard to optimize loops, but in my case it's counterproductive. I have illustrated the problem in the following C++ code and disassembly. Loops that are constructed this way need only one variable (`i`) as a loop counter and use sign flag to check whether the loop is done or not. Typically such loop requires simple check at the beginning (`sub` and `js`) and at the end. The purpose of such loop is to save registers and to only require minimal code surrounding the loop. However, it seems that GCC tries to convert such loop into something else and requires a lot of operations to do that, resulting in bigger and slower code. When using `-Os` GCC produces code that I would expect, however, I don't want to optimize for size globally. It's not a compiler bug, but I think that in this case this optimization doesn't make any sense and only adds to the executable/library size. I doubt this leads to any improvement and it would be nice if GCC can somehow discover to not do this for these kind of loops. Also, here is a compiler explorer URL, for people wanting to compare: https://godbolt.org/g/oeDGmy Consider the following C++ code --- #include #if defined(_MSC_VER) # include #else # include #endif void transform(double* dst, const double* src, const double* matrix, size_t length) { __m256d m_00_11 = _mm256_castpd128_pd256(_mm_set_pd(matrix[3], matrix[0])); __m256d m_10_01 = _mm256_castpd128_pd256(_mm_set_pd(matrix[1], matrix[2])); __m256d m_20_21 = _mm256_broadcast_pd(reinterpret_cast(matrix + 4)); m_00_11 = _mm256_permute2f128_pd(m_00_11, m_00_11, 0); m_10_01 = _mm256_permute2f128_pd(m_10_01, m_10_01, 0); intptr_t i = static_cast(length); while ((i -= 8) >= 0) { __m256d s0 = _mm256_loadu_pd(src + 0); __m256d s1 = _mm256_loadu_pd(src + 4); __m256d s2 = _mm256_loadu_pd(src + 8); __m256d s3 = _mm256_loadu_pd(src + 12); __m256d a0 = _mm256_add_pd(_mm256_mul_pd(s0, m_00_11), m_20_21); __m256d a1 = _mm256_add_pd(_mm256_mul_pd(s1, m_00_11), m_20_21); __m256d a2 = _mm256_add_pd(_mm256_mul_pd(s2, m_00_11), m_20_21); __m256d a3 = _mm256_add_pd(_mm256_mul_pd(s3, m_00_11), m_20_21); __m256d b0 = _mm256_mul_pd(_mm256_shuffle_pd(s0, s0, 0x1), m_10_01); __m256d b1 = _mm256_mul_pd(_mm256_shuffle_pd(s1, s1, 0x1), m_10_01); __m256d b2 = _mm256_mul_pd(_mm256_shuffle_pd(s2, s2, 0x1), m_10_01); __m256d b3 = _mm256_mul_pd(_mm256_shuffle_pd(s3, s3, 0x1), m_10_01); _mm256_storeu_pd(dst + 0, _mm256_add_pd(a0, b0)); _mm256_storeu_pd(dst + 4, _mm256_add_pd(a1, b1)); _mm256_storeu_pd(dst + 8, _mm256_add_pd(a2, b2)); _mm256_storeu_pd(dst + 12, _mm256_add_pd(a3, b3)); dst += 16; src += 16; } i += 8; while ((i -= 2) >= 0) { __m256d s0 = _mm256_loadu_pd(src); __m256d a0 = _mm256_add_pd(_mm256_mul_pd(s0, m_00_11), m_20_21); __m256d b0 = _mm256_mul_pd(_mm256_shuffle_pd(s0, s0, 0x1), m_10_01); _mm256_storeu_pd(dst, _mm256_add_pd(a0, b0)); dst += 4; src += 4; } if (i & 1) { __m128d s0 = _mm_loadu_pd(src + 0); __m128d a0 = _mm_add_pd(_mm_mul_pd(s0, _mm256_castpd256_pd128(m_00_11)), _mm256_castpd256_pd128(m_20_21)); __m128d b0 = _mm_mul_pd(_mm_shuffle_pd(s0, s0, 0x1), _mm256_castpd256_pd128(m_10_01)); _mm_storeu_pd(dst + 0, _mm_add_pd(a0, b0)); } } Which is compiled to the following -- (-O2 -mavx -fno-exceptions -fno-tree-vectorize) See comments describing what I din't like.. transform(double*, double const*, double const*, unsigned long): vmovsd xmm4, QWORD PTR [rdx] mov r9, rcx vmovsd xmm5, QWORD PTR [rdx+16] sub r9, 8 vmovhpd xmm4, xmm4, QWORD PTR [rdx+24] vbroadcastf128 ymm6, XMMWORD PTR [rdx+32] mov r8, rcx vmovhpd xmm5, xmm5, QWORD PTR [rdx+8] vperm2f128 ymm4, ymm4, ymm4, 0 vperm2f128 ymm5, ymm5, ymm5, 0 js .L6 ;; <--- Weird mov rax, r9 sub rcx, 16 mov r8, r9 and rax, -8 mov rdx, rsi sub rcx, rax mov rax, rdi ;; <--- Weird .L5: vmovupd xmm3, XMMWORD PTR [rdx] sub r8, 8 sub rax, -128 sub rdx, -128 vinsertf128 ymm3, ymm3, XMMWORD PTR [rdx-112], 0x1
[Bug tree-optimization/79830] GCC generates counterproductive code surrounding very simple loops (improvement request)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79830 --- Comment #2 from Petr --- I'm not sure I follow with the exit test. I mean the code should be correct as each point has x|y coord, which is two doubles, so length 8 means 16 doubles (I converted from my production code into a simpler form that uses only native types). However, I think that the problem is also that if this code was handwritten it would only use 3 registers (dst, src, and i), but GCC uses: rax|rcd|rdx|rsi|rdi|r8|r9 which is a lot and the same code in 32-bit mode contains one short spill of GP register. Basically if I needed more GP registers inside the function the problem would be much bigger (but no clue if GCC would use different approach in that case).
[Bug tree-optimization/79830] GCC generates counterproductive code surrounding very simple loops (improvement request)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79830 --- Comment #3 from Petr --- Sorry for misunderstanding, I really read initially that you replaced the exit condition in the sample code :)
[Bug tree-optimization/79830] GCC generates counterproductive code surrounding very simple loops (improvement request)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79830 --- Comment #4 from Petr --- I think the test-case can be simplified to the following code. It still suffers from the same issues as mentioned above. #include #if defined(_MSC_VER) # include #else # include #endif void transform(double* dst, const double* src, const double* matrix, size_t length) { intptr_t i = static_cast(length); while ((i -= 2) >= 0) { __m256d s0 = _mm256_loadu_pd(src); _mm256_storeu_pd(dst, _mm256_add_pd(s0, s0)); dst += 4; src += 4; } if (i & 1) { __m128d s0 = _mm_loadu_pd(src); _mm_storeu_pd(dst, _mm_add_pd(s0, s0)); } }
[Bug inline-asm/79880] New: Gcc refuses to encode vpgatherdd instruction (x86-64)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79880 Bug ID: 79880 Summary: Gcc refuses to encode vpgatherdd instruction (x86-64) Product: gcc Version: 7.0.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: inline-asm Assignee: unassigned at gcc dot gnu.org Reporter: kobalicek.petr at gmail dot com Target Milestone: --- I'm unable to encode `vpgatherdd xmm, mem, xmm` instruction in inline asm: void test() { __asm(".intel_syntax\n" "vpgatherdd xmm4, [r13 + xmm3], xmm4\n" ".att_syntax\n"); } It seems that ICC refuses this construct as well while clang is fine with that. I'm not sure if it's bug or this form of the instruction is incorrect. But according to X86 Architecture Reference Manual it's encodable.
[Bug inline-asm/79880] Gcc refuses to encode vpgatherdd instruction (x86-64)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79880 --- Comment #4 from Petr --- In this case, DWORD PTR is redundant, nasm and yasm is fine with the syntax I posted as well. It's a simplified test just to show that it won't pass. Try: __asm(".intel_syntax\n" "vpgatherdd xmm4, dword ptr [r13 + xmm3], xmm4\n" ".att_syntax\n"); I will end up reporting the same error.
[Bug inline-asm/79880] Gcc refuses to encode vpgatherdd instruction (x86-64)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79880 --- Comment #6 from Petr --- Ok, that's fair enough. I didn't know GCC needs an additional option to switch to fully compatible Intel syntax. The code that I posted works fine in clang, so sorry about that. And yes, the instruction will #UD, but that was the point, initially :)
[Bug c++/77287] New: Much worse code generated compared to clang (stack alignment and spills)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77287 Bug ID: 77287 Summary: Much worse code generated compared to clang (stack alignment and spills) Product: gcc Version: 6.1.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: kobalicek.petr at gmail dot com Target Milestone: --- A simple function (artificial code): #include int fn( const int* px, const int* py, const int* pz, const int* pw, const int* pa, const int* pb, const int* pc, const int* pd) { __m256i a0 = _mm256_loadu_si256((__m256i*)px); __m256i a1 = _mm256_loadu_si256((__m256i*)py); __m256i a2 = _mm256_loadu_si256((__m256i*)pz); __m256i a3 = _mm256_loadu_si256((__m256i*)pw); __m256i a4 = _mm256_loadu_si256((__m256i*)pa); __m256i b0 = _mm256_loadu_si256((__m256i*)pb); __m256i b1 = _mm256_loadu_si256((__m256i*)pc); __m256i b2 = _mm256_loadu_si256((__m256i*)pd); __m256i b3 = _mm256_loadu_si256((__m256i*)pc + 1); __m256i b4 = _mm256_loadu_si256((__m256i*)pd + 1); __m256i x0 = _mm256_packus_epi16(a0, b0); __m256i x1 = _mm256_packus_epi16(a1, b1); __m256i x2 = _mm256_packus_epi16(a2, b2); __m256i x3 = _mm256_packus_epi16(a3, b3); __m256i x4 = _mm256_packus_epi16(a4, b4); x0 = _mm256_add_epi16(x0, a0); x1 = _mm256_add_epi16(x1, a1); x2 = _mm256_add_epi16(x2, a2); x3 = _mm256_add_epi16(x3, a3); x4 = _mm256_add_epi16(x4, a4); x0 = _mm256_sub_epi16(x0, b0); x1 = _mm256_sub_epi16(x1, b1); x2 = _mm256_sub_epi16(x2, b2); x3 = _mm256_sub_epi16(x3, b3); x4 = _mm256_sub_epi16(x4, b4); x0 = _mm256_packus_epi16(x0, x1); x0 = _mm256_packus_epi16(x0, x2); x0 = _mm256_packus_epi16(x0, x3); x0 = _mm256_packus_epi16(x0, x4); return _mm256_extract_epi32(x0, 1); } Produces the following asm when compiled by GCC (annotated by me): ; GCC 6.1 -O2 -Wall -mavx2 -m32 -fomit-frame-pointer lea ecx, [esp+4] ; Return address and esp, -32 ; Align the stack to 32 bytes push DWORD PTR [ecx-4] ; Push returned address push ebp ; Save frame-pointer even if I told GCC to not to mov ebp, esp push edi ; Save GP regs push esi push ebx push ecx sub esp, 296 ; Reserve stack for YMM spills mov eax, DWORD PTR [ecx+16] ; LOAD 'pa' mov esi, DWORD PTR [ecx+4]; LOAD 'py' mov edi, DWORD PTR [ecx] ; LOAD 'px' mov ebx, DWORD PTR [ecx+8]; LOAD 'pz' mov edx, DWORD PTR [ecx+12] ; LOAD 'pw' mov DWORD PTR [ebp-120], eax ; SPILL 'pa' mov eax, DWORD PTR [ecx+20] ; LOAD 'pb' mov DWORD PTR [ebp-152], eax ; SPILL 'pb' mov eax, DWORD PTR [ecx+24] ; LOAD 'pc' vmovdqu ymm4, YMMWORD PTR [esi] mov ecx, DWORD PTR [ecx+28] ; LOAD 'pd' vmovdqu ymm7, YMMWORD PTR [edi] vmovdqa YMMWORD PTR [ebp-56], ymm4; SPILL VEC vmovdqu ymm4, YMMWORD PTR [ebx] mov ebx, DWORD PTR [ebp-152] ; LOAD 'pb' vmovdqa YMMWORD PTR [ebp-88], ymm4; SPILL VEC vmovdqu ymm4, YMMWORD PTR [edx] mov edx, DWORD PTR [ebp-120] ; LOAD 'pa' vmovdqu ymm6, YMMWORD PTR [edx] vmovdqa YMMWORD PTR [ebp-120], ymm6 ; SPILL VEC vmovdqu ymm0, YMMWORD PTR [ecx] vmovdqu ymm6, YMMWORD PTR [ebx] vmovdqa ymm5, ymm0; Why to move anything when using AVX? vmovdqu ymm0, YMMWORD PTR [eax+32] vmovdqu ymm2, YMMWORD PTR [eax] vmovdqa ymm1, ymm0; Why to move anything when using AVX? vmovdqu ymm0, YMMWORD PTR [ecx+32] vmovdqa YMMWORD PTR [ebp-152], ymm2 vmovdqa ymm3, ymm0; Why to move anything when using AVX? vpackuswb ymm0, ymm7, ymm6 vmovdqa YMMWORD PTR [ebp-184], ymm5 ; SPILL VEC vmovdqa YMMWORD PTR [ebp-248], ymm3 ; SPILL VEC vmovdqa YMMWORD PTR [ebp-280], ymm0 ; SPILL VEC vmovdqa ymm0, YMMWORD PTR [ebp-56]; ALLOC VEC vmovdqa YMMWORD PTR [ebp-216], ymm1 ; SPILL VEC vpackuswb ymm2, ymm0, YMMWORD PTR [ebp-152] ; Uses SPILL slot vmovdqa ymm0, YMMWORD PTR [ebp-88]; ALLOC VEC vpackuswb ymm1, ymm4, YMMWORD PTR [ebp-216] ; Uses SPILL slot vpackuswb ymm5, ymm0, YMMWORD PTR [ebp-184] ; Uses SPILL slot vmovdqa ymm0, YMMWORD PTR [ebp-120] ; ALLOC VEC vpaddwymm2, ymm2, YMMWORD PTR [ebp-56] ; Uses SPILL slot vpsubwymm2, ymm2, YMMWORD PTR [ebp-152] ; Uses SPILL slot vpackuswb ymm3, ymm0, YMMWORD PTR [ebp-248] ; Uses
[Bug target/77287] Much worse code generated compared to clang (stack alignment and spills)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77287 --- Comment #2 from Petr --- With '-mtune=intel' the push/pop sequence is gone, but YMM register management remains the same - 24 memory accesses more than clang.
[Bug rtl-optimization/77287] Much worse code generated compared to clang (stack alignment and spills)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77287 --- Comment #4 from Petr --- Adding -fschedule-insns is definitely a huge improvement in this case. I wonder why this doesn't happen by default at -O2 and -Os, as it really improves things and makes shorter output, or it's just in this particular case? Here is the assembly produced by gcc with -fschedule-insns: push ebp mov ebp, esp and esp, -32 lea esp, [esp-32] mov ecx, DWORD PTR [ebp+8] mov edx, DWORD PTR [ebp+32] mov eax, DWORD PTR [ebp+36] vmovdqu ymm5, YMMWORD PTR [ecx] mov ecx, DWORD PTR [ebp+12] vmovdqu ymm3, YMMWORD PTR [edx] vmovdqu ymm6, YMMWORD PTR [eax] vmovdqu ymm2, YMMWORD PTR [ecx] mov ecx, DWORD PTR [ebp+28] vpackuswb ymm7, ymm2, ymm3 vpaddwymm7, ymm7, ymm2 vpsubwymm7, ymm7, ymm3 vmovdqu ymm4, YMMWORD PTR [ecx] mov ecx, DWORD PTR [ebp+16] vpackuswb ymm0, ymm5, ymm4 vpaddwymm0, ymm0, ymm5 vpsubwymm0, ymm0, ymm4 vmovdqu ymm1, YMMWORD PTR [ecx] vpackuswb ymm0, ymm0, ymm7 mov ecx, DWORD PTR [ebp+20] vpackuswb ymm2, ymm1, ymm6 vmovdqu ymm4, YMMWORD PTR [edx+32] vpaddwymm1, ymm2, ymm1 mov edx, DWORD PTR [ebp+24] vpsubwymm1, ymm1, ymm6 vmovdqu ymm5, YMMWORD PTR [ecx] vpackuswb ymm0, ymm0, ymm1 vpackuswb ymm3, ymm5, ymm4 vmovdqa YMMWORD PTR [esp], ymm3 vmovdqu ymm2, YMMWORD PTR [eax+32]; LOOK HERE vpaddwymm5, ymm5, YMMWORD PTR [esp] vmovdqu ymm3, YMMWORD PTR [edx] ; AND HERE vpsubwymm4, ymm5, ymm4 vpackuswb ymm7, ymm3, ymm2 vpackuswb ymm0, ymm0, ymm4 vpaddwymm3, ymm7, ymm3 vpsubwymm2, ymm3, ymm2 vpackuswb ymm2, ymm0, ymm2 vpextrd eax, xmm2, 1 vzeroupper leave ret Which is pretty close to clang already, however, look at this part: vmovdqa YMMWORD PTR [esp], ymm3 ; Spill YMM3 vmovdqu ymm2, YMMWORD PTR [eax+32] vpaddwymm5, ymm5, YMMWORD PTR [esp] ; Mem instead of YMM3? vmovdqu ymm3, YMMWORD PTR [edx] ; Old YMM3 becomes dead here The spill is completely unnecessary in our case, and it's the only reason why the prolog/epilog requires code to perform dynamic stack alignment. I mean if this one thing is eliminated then GCC basically generates a comparable code to clang. But thanks for -fschedule-insns hint, I didn't know about it.
[Bug c++/103699] New: Reading or writing unaligned integers is wrongly optimized (GCC-11 and up)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103699 Bug ID: 103699 Summary: Reading or writing unaligned integers is wrongly optimized (GCC-11 and up) Product: gcc Version: 11.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: kobalicek.petr at gmail dot com Target Milestone: --- I have found a strange issue. When I use __attribute__((aligned(1)) on a type to essentially annotate its lower alignment for memory load/store purposes, the compiler would optimize such loads/stores out without any warnings. I have never had a problem with this in GCC nor Clang, only GCC 11+ generates wrong code for me. Best illustrated in the code below [Compile with -O2 -std=c++17] #include typedef uint32_t __attribute__((__aligned__(1))) UnalignedUInt32; typedef uint64_t __attribute__((__aligned__(1))) UnalignedUInt64; uint32_t byteswap32(uint32_t x) noexcept { return (x << 24) | (x >> 24) | ((x << 8) & 0x00FFu) | ((x >> 8) & 0xFF00); } uint64_t byteswap64(uint64_t x) noexcept { return ((x << 56) & 0xff00) | ((x << 40) & 0x00ff) | ((x << 24) & 0xff00) | ((x << 8) & 0x00ff) | ((x >> 8) & 0xff00) | ((x >> 24) & 0x00ff) | ((x >> 40) & 0xff00) | ((x >> 56) & 0x00ff); } static inline void writeU64be(void* p, uint64_t val) { static_cast(p)[0] = byteswap64(val); } static inline uint32_t readU32be(const void* p) noexcept { uint32_t x = static_cast(p)[0]; return byteswap32(x); } // Returns 0xBB uint32_t test_1() { uint8_t array[16] {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}; writeU64be(array + 6, 0xAABBCCDDEEFF1213); return array[7]; } // Returns 0xCC uint32_t test_2() { uint8_t array[16] {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}; writeU64be(array + 6, 0xAABBCCDDEEFF1213); return array[8]; } // Returns 0xDD uint32_t test_3() { uint8_t array[16] {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}; writeU64be(array + 6, 0xAABBCCDDEEFF1213); return array[9]; } // Returns 0xEE uint32_t test_4() { uint8_t array[16] {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}; writeU64be(array + 6, 0xAABBCCDDEEFF1213); return array[10]; } // Returns 0708090A - the write has no effect when read with readU32be() uint32_t test_u32() { uint8_t array[16] {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}; writeU64be(array + 6, 0xAABBCCDDEEFF1213); return readU32be(array + 7); } So I'm wondering, is this a correct behavior? It seems like a bug in the optimizer to me, because when the code is dynamic (the data is not consts) it seems to work as expected. I have found it in a failing unit test (GCC 11 is the only compiler that fails). Compiler Explorer: https://godbolt.org/z/9G9cx83oq
[Bug c++/103699] Reading or writing a constant unaligned value is wrongly optimized causing an incorrect result (GCC-11 and up)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103699 --- Comment #2 from Petr --- If you compile this with clang the function test_u32() will corretly return the expected 0xBBCCDDEE and not 0x0708090A. If you compile with older GCC, like GCC 10, the test would also return 0xBBCCDDEE. Only GCC-11 and upward return 0x0708090A. Based on the documentation here: https://gcc.gnu.org/onlinedocs/gcc/Common-Variable-Attributes.html#Common-Variable-Attributes It's stated that "When used as part of a typedef, the aligned attribute can both increase and decrease alignment, and specifying the packed attribute generates a warning." - it explicitly allows to create a type with lower alignment, so I don't consider this undefined behavior - in general UBSAN is fine with this construct.
[Bug c++/103699] Reading or writing a constant unaligned value is wrongly optimized causing an incorrect result (GCC-11 and up)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103699 --- Comment #3 from Petr --- BTW this almost seems like an optimizer bug, because if you compile the code without optimizations with GCC 11 (or with -O1) it also returns the expected value - only optimized compilation with GCC 11 returns the wrong one.
[Bug c++/103699] Reading or writing a constant unaligned value is wrongly optimized causing an incorrect result (GCC-11 and up)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103699 --- Comment #4 from Petr --- Additional test case: #include #include typedef uint32_t __attribute__((__aligned__(1))) UnalignedUInt32; typedef uint64_t __attribute__((__aligned__(1))) UnalignedUInt64; uint32_t byteswap32(uint32_t x) noexcept { return (x << 24) | (x >> 24) | ((x << 8) & 0x00FFu) | ((x >> 8) & 0xFF00); } uint64_t byteswap64(uint64_t x) noexcept { return ((x << 56) & 0xff00) | ((x << 40) & 0x00ff) | ((x << 24) & 0xff00) | ((x << 8) & 0x00ff) | ((x >> 8) & 0xff00) | ((x >> 24) & 0x00ff) | ((x >> 40) & 0xff00) | ((x >> 56) & 0x00ff); } static inline void writeU64be(void* p, uint64_t val) { static_cast(p)[0] = byteswap64(val); } static inline uint32_t readU32be(const void* p) noexcept { uint32_t x = static_cast(p)[0]; return byteswap32(x); } // Returns 0708090A uint32_t test_u32() { uint8_t array[16] {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}; writeU64be(array + 6, 0xAABBCCDDEEFF1213); return readU32be(array + 7); } static uint8_t array_static[16] {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}; int main() { printf("%08X\n", test_u32()); writeU64be(array_static + 6, 0xAABBCCDDEEFF1213); printf("%08X\n", readU32be(array_static + 7)); return 0; } It prints: 0708090A BBCCDDEE Clang prints: BBCCDDEE BBCCDDEE
[Bug c++/103699] Reading or writing a constant unaligned value is wrongly optimized causing an incorrect result (GCC-11 and up)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103699 --- Comment #6 from Petr --- For now I have disabled unaligned load/store optimizations in my projects when dealing with GCC 11 and upwards. I still think that GCC is wrong in this case regardless of strict aliasing. The code in func_u32() is essentially creating a constant, and GCC 11+ is the only compiler returning it wrong and also inconsistently between optimization levels.
[Bug c++/103699] Reading or writing a constant unaligned value is wrongly optimized causing an incorrect result (GCC-11 and up)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103699 --- Comment #8 from Petr --- My only problem is that A returns a different value compared to B, C, and D: uint32_t test_u32_a() { char array[16] {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}; writeU64be(array + 6, 0xAABBCCDDEEFF1213); return readU32be(array + 7); } uint32_t test_u32_b() { static char array[16] {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}; writeU64be(array + 6, 0xAABBCCDDEEFF1213); return readU32be(array + 7); } uint32_t test_u32_c() { thread_local char array[16] {0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}; writeU64be(array + 6, 0xAABBCCDDEEFF1213); return readU32be(array + 7); } uint32_t test_u32_d(char array[16]) { writeU64be(array + 6, 0xAABBCCDDEEFF1213); return readU32be(array + 7); } And when you compile this, you would actually see that ALL functions evaluate to a constant (because it's known what the output will be), but only in A case the constant is different (of course because B, C, D have side effects): test_u32_a(): mov eax, 117967114 ret test_u32_b(): movabs rax, 1374442237689904042 mov QWORD PTR test_u32_b()::array[rip+6], rax mov eax, -1144201746 ret test_u32_c(): movabs rax, 1374442237689904042 mov QWORD PTR fs:test_u32_c()::array@tpoff+6, rax mov eax, -1144201746 ret test_u32_d(char*): movabs rax, 1374442237689904042 mov QWORD PTR [rdi+6], rax mov eax, -1144201746 ret So yeah, we can talk about breaking strict aliasing here, but it's just inconsistent. I would just expect all functions return the same value.
[Bug c++/103699] Reading or writing a constant unaligned value is wrongly optimized causing an incorrect result (GCC-11 and up)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103699 --- Comment #10 from Petr --- Well, the problem is, that when you compile it with "-fsanitize=undefined" - it won't report any undefined behavior, and the function would return the expected value. I even tried to make everything constexpr - and constexpr by definition should never involve undefined behavior, right? But GCC 11 compiles the code with constexpr, doesn't complain against undefined behavior, and again, returns the wrong value. What I miss here, from user perspective, is some kind of diagnostics. If you remove code that is provable to have effect on the code following it, why not to complain, at least in constexpr case?
[Bug c++/103699] Reading or writing a constant unaligned value is wrongly optimized causing an incorrect result (GCC-11 and up)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103699 --- Comment #12 from Petr --- Is there a way to diagnose this? To tell GCC to report transformations that basically cause wrong results returned? In my code base, I have unaligned memory loads/stores abstracted, so I can implement whatever compiler specific construct I need to make such accesses optimized or friendly to the compiler - and this method that I have shown here, I didn't invent it. I would even say that I only see this error in unit tests, which can be basically constant evaluated like shown in my previous examples. But it was kinda surprising that GCC 11+ is the only compiler that started failing my tests, and that no analyzer basically complains about this (nor clang static analysis, for example).
[Bug c++/103699] Reading or writing a constant unaligned value is wrongly optimized causing an incorrect result (GCC-11 and up)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103699 --- Comment #15 from Petr --- Unfortunately GCC doesn't report any issues even with `-Wstrict-aliasing=1`. BTW now I know I must use the may_alias attribute to my satisfaction, and this is what I'm gonna do, however, from user perspective I'm not really happy as GCC does something very silently. Personally, I would be happier if my code doesn't compile at all than having it compiled with bugs.
[Bug c++/103699] Reading or writing a constant unaligned value is wrongly optimized causing an incorrect result (GCC-11 and up)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103699 --- Comment #17 from Petr --- Guys thanks a lot for your feedback. Is the may_alias annotation guaranteed to behave as expected in the future versions of GCC too, or it's just too much UB that it's better to do unaligned reads with memcpy? What I like on my existing solution is that I can specify alignment, so for example I can say that this 64-bit load is 4-byte aligned, and that hints a compiler, I'm not sure how to hint that with memcpy. So far annotating the code with may_alias works for me, it passes tests, but I'm kinda unsure whether this is a good way to express unaligned memory access considering I have faced this issue.
[Bug target/77287] Much worse code generated compared to clang (stack alignment and spills)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77287 --- Comment #6 from Petr --- Yes, the code is not really doing anything useful, I only wrote it to demonstrate the spills problem. Clang actually outsmarted me by removing half of the code :) I think this issue can be closed, I cannot repro this with the newest GCC.
[Bug c++/116738] New: Constant folding of _mm_min_ss and _mm_max_ss is wrong
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116738 Bug ID: 116738 Summary: Constant folding of _mm_min_ss and _mm_max_ss is wrong Product: gcc Version: 14.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: kobalicek.petr at gmail dot com Target Milestone: --- GCC incorrectly optimizes x86 intrinsics, which have a defined operation at the ISA level. It seems that the problem happens when a value is known at compile time, hence constant folding uses a different operation compared to the CPU when executed as an instruction. Here is the definition of [V]MINSS: ``` MIN(SRC1, SRC2) { IF ((SRC1 = 0.0) and (SRC2 = 0.0)) THEN DEST := SRC2; ELSE IF (SRC1 = NaN) THEN DEST := SRC2; FI; ELSE IF (SRC2 = NaN) THEN DEST := SRC2; FI; ELSE IF (SRC1 < SRC2) THEN DEST := SRC1; ELSE DEST := SRC2; FI; } ``` So, it's clear that the SRC1 is only selected when an ordered comparison `SRC1 < SRC2` is true. However, GCC doesn't seem to respect this detail. Here is a test case that I was able to craft: ``` // Demonstration of a GCC bug in constant folding of SIMD intrinsics: #include #include #include #include float clamp(float f) { __m128 v = _mm_set_ss(f); __m128 zero = _mm_setzero_ps(); __m128 greatest = _mm_set_ss(std::numeric_limits::max()); v = _mm_min_ss(v, greatest); v = _mm_max_ss(v, zero); return _mm_cvtss_f32(v); } int main() { printf("clamp(-0) -> %f\n", clamp(-0.0f)); printf("clamp(nan) -> %f\n", clamp(std::numeric_limits::quiet_NaN())); return 0; } ``` GCC results (wrong): clamp(-0) -> -0.00 clamp(nan) -> nan Clang results (expected): clamp(-0) -> 0.00 clamp(nan) -> 340282346638528859811704183484516925440.00 Here is a compiler explorer link: - https://godbolt.org/z/6afjoaj86 I'm aware this is a possible duplicate of an [UNCONFIRMED] bug: - https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99497 However, fast-math is mentioned in that bug report, and I'm not interested in fast-math at all, I'm not using that option. This bug makes it impossible to create test cases for the implementation of some optimized functions that I use as tests with GCC fail, but other compilers produce correct results. A possible workaround is to use _ps instead of _ss variant of the intrinsics, but that's also something I would like to avoid as in some cases I really work with a scalar value only. Also interestingly, when compiled by GCC in debug mode (without optimizations) GCC behaves correctly, so this bug is related to the optimization pipeline as well. I'm not aware of any UB in this test case.
[Bug target/116738] Constant folding of _mm_min_ss and _mm_max_ss is wrong
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116738 --- Comment #3 from Petr --- Maybe marking it as confirmed would be appropriate then? I think as a workaround it would be better to not constant fold code that GCC cannot compute properly - that would mean properly calculating the values at runtime. I have no idea what else this would impact though.
[Bug target/116738] Constant folding of _mm_min_ss and _mm_max_ss is wrong
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116738 --- Comment #7 from Petr --- The simplified test case looks good except for a missing return :)