[Bug middle-end/78115] New: Missed optimization for "int modulo 2^31"
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78115 Bug ID: 78115 Summary: Missed optimization for "int modulo 2^31" Product: gcc Version: 6.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: tkoeppe at google dot com Target Milestone: --- Consider the operation of mapping an int to the unique modular representative in [0, 2^31). Readable code: #include int mod31(int num) { if (num < 0) { num = num + 1 + INT_MAX; } return num; } Paranoid bit-shifter's code: int mod31shift(int num) { return static_cast(num) % (1U << 31); } Clang generates the same machine code for both, but GCC does not: https://godbolt.org/g/2BjNqA
[Bug c++/62006] New: Bad code generation with -O3 (possibly due to -ftree-partial-pre)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62006 Bug ID: 62006 Summary: Bad code generation with -O3 (possibly due to -ftree-partial-pre) Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: tkoeppe at google dot com Created attachment 33231 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=33231&action=edit Demonstrates bad code generation with -O3 I was writing some code to demonstrate custom allocators with fancy pointers. While testing it with GCC, I noticed memory corruption when compiling with -O3. Jonathan Wakely had a quick look and narrowed it down to "-O2 -ftree-partial-pre"; with just "-O2" the code works. I don't have any insights in the problem, so I'm just attaching the full code. You can tell that it's broken by passing the program through valgrind, or simply by noting that it prints all container elements as zero, rather than their actual values. (Incidentally, I believe there's a similar problem in Clang: http://llvm.org/bugs/show_bug.cgi?id=20524)
[Bug tree-optimization/62006] Bad code generation with -O3 (possibly due to -ftree-partial-pre)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62006 --- Comment #4 from Thomas Köppe --- Ah, you're right, this offset pointer computation as it stands is undefined behaviour. The intended use is to use those pointers only within a preallocated arena, so that the pointers would indeed live in a common object (a large array). I shall change the allocator to an arena allocator and rerun the test. The intended use for offset pointers is to inter-process communication; a vector with the fancy-pointer allocator can be used from two separate processes.
[Bug tree-optimization/62006] Bad code generation with -O3 (possibly due to -ftree-partial-pre)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62006 --- Comment #6 from Thomas Köppe --- Created attachment 33236 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=33236&action=edit Fixed demo that doesn't have UB on account of invalid pointer arithmetic Here's a (very lazily) fixed version of the code that allocates from an arena that is a single, large array. The same problem persists in both GCC and Clang.
[Bug tree-optimization/62006] Bad code generation with -O3 (possibly due to -ftree-partial-pre)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62006 --- Comment #9 from Thomas Köppe --- Argh, yes, I compiled the wrong file... indeed, the arena version works with GCC 4.8.2 for me, too, and in Clang as well. So... not an issue, I suppose? The desired real application will be for containers allocated in shared memory, which is presumably obtained from some opaque system feature.
[Bug tree-optimization/62006] Bad code generation with -O3 (possibly due to -ftree-partial-pre)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62006 --- Comment #11 from Thomas Köppe --- Created attachment 33240 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=33240&action=edit Further fixing: Uses uintptr_ts for the difference On Jonathan's suggestion I changed the distance computation to go through a uintptr_t conversion. Jonathan suggested compiling with -fno-elide-constructors, and indeed the attached code breaks when that option is passed. As you said, the UB caused by the distance computations of automatic objects seems to be the stumbling point.
[Bug tree-optimization/66713] New: atomic compare_excahnge_strong create spurious store for x86-64 at -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66713 Bug ID: 66713 Summary: atomic compare_excahnge_strong create spurious store for x86-64 at -O3 Product: gcc Version: 4.9.2 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: tkoeppe at google dot com Target Milestone: --- The code here: https://goo.gl/CiV4pl compares a hand-written CAS with the C++11 atomic one. On Clang, the code comes out identical. On GCC 4.9.2 and later there is an extra store ("movq %rdi, -8(%rsp)"). Should that not be there?
[Bug middle-end/66713] atomic compare_exchange_strong creates spurious store for x86-64 at -O3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66713 --- Comment #3 from Thomas Köppe --- Note: The code in question, and the hand-written assembly, are taken from the ZMQ library: https://github.com/zeromq/libzmq/blob/master/src/atomic_ptr.hpp I added the C++11 atomic support recenlty.
[Bug middle-end/66881] New: Possibly inefficient std::atomic codegen on x86 for simple arithmetic
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66881 Bug ID: 66881 Summary: Possibly inefficient std::atomic codegen on x86 for simple arithmetic Product: gcc Version: 4.9.2 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: tkoeppe at google dot com Target Milestone: --- Consider these two simple versions of addition: #include std::atomic x; int y; void f(int a) { x.store(x.load(std::memory_order_relaxed) + a, std::memory_order_relaxed); } void g(int a) { y += a; } GCC generates the following assembly: f(int): mov eax, DWORD PTR x[rip] add edi, eax mov DWORD PTR x[rip], edi ret g(int): add DWORD PTR y[rip], edi ret Now, it is clear to me that the correct atomic codegen for store() and load() is "mov", as it appears here, but why aren't the two consecutive operations not folded into a single add? Aren't the semantics and the memory ordering the same? x86 says that (most) "reads" and "writes" are strongly ordered; doesn't that apply to the read and write produced by "add", too? (My original motivation came from a variant of this with floats, where the non-atomic code executed noticeably faster, even though I would have expected the two to produce the same machine code.)