[Bug middle-end/78115] New: Missed optimization for "int modulo 2^31"

2016-10-26 Thread tkoeppe at google dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78115

Bug ID: 78115
   Summary: Missed optimization for "int modulo 2^31"
   Product: gcc
   Version: 6.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: tkoeppe at google dot com
  Target Milestone: ---

Consider the operation of mapping an int to the unique modular representative
in [0, 2^31).


Readable code:

#include 

int mod31(int num) {
  if (num < 0) { num = num + 1 + INT_MAX; }
  return num;
}


Paranoid bit-shifter's code:

int mod31shift(int num) {
  return static_cast(num) % (1U << 31);
}

Clang generates the same machine code for both, but GCC does not:
https://godbolt.org/g/2BjNqA

[Bug c++/62006] New: Bad code generation with -O3 (possibly due to -ftree-partial-pre)

2014-08-04 Thread tkoeppe at google dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62006

Bug ID: 62006
   Summary: Bad code generation with -O3 (possibly due to
-ftree-partial-pre)
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: tkoeppe at google dot com

Created attachment 33231
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=33231&action=edit
Demonstrates bad code generation with -O3

I was writing some code to demonstrate custom allocators with fancy pointers.
While testing it with GCC, I noticed memory corruption when compiling with -O3.
Jonathan Wakely had a quick look and narrowed it down to "-O2
-ftree-partial-pre"; with just "-O2" the code works.

I don't have any insights in the problem, so I'm just attaching the full code.
You can tell that it's broken by passing the program through valgrind, or
simply by noting that it prints all container elements as zero, rather than
their actual values.

(Incidentally, I believe there's a similar problem in Clang:
http://llvm.org/bugs/show_bug.cgi?id=20524)


[Bug tree-optimization/62006] Bad code generation with -O3 (possibly due to -ftree-partial-pre)

2014-08-04 Thread tkoeppe at google dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62006

--- Comment #4 from Thomas Köppe  ---
Ah, you're right, this offset pointer computation as it stands is undefined
behaviour. The intended use is to use those pointers only within a preallocated
arena, so that the pointers would indeed live in a common object (a large
array).

I shall change the allocator to an arena allocator and rerun the test.

The intended use for offset pointers is to inter-process communication; a
vector with the fancy-pointer allocator can be used from two separate
processes.

[Bug tree-optimization/62006] Bad code generation with -O3 (possibly due to -ftree-partial-pre)

2014-08-04 Thread tkoeppe at google dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62006

--- Comment #6 from Thomas Köppe  ---
Created attachment 33236
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=33236&action=edit
Fixed demo that doesn't have UB on account of invalid pointer arithmetic

Here's a (very lazily) fixed version of the code that allocates from an arena
that is a single, large array.

The same problem persists in both GCC and Clang.

[Bug tree-optimization/62006] Bad code generation with -O3 (possibly due to -ftree-partial-pre)

2014-08-04 Thread tkoeppe at google dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62006

--- Comment #9 from Thomas Köppe  ---
Argh, yes, I compiled the wrong file... indeed, the arena version works with
GCC 4.8.2 for me, too, and in Clang as well.

So... not an issue, I suppose?

The desired real application will be for containers allocated in shared memory,
which is presumably obtained from some opaque system feature.

[Bug tree-optimization/62006] Bad code generation with -O3 (possibly due to -ftree-partial-pre)

2014-08-04 Thread tkoeppe at google dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62006

--- Comment #11 from Thomas Köppe  ---
Created attachment 33240
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=33240&action=edit
Further fixing: Uses uintptr_ts for the difference

On Jonathan's suggestion I changed the distance computation to go through a
uintptr_t conversion.

Jonathan suggested compiling with -fno-elide-constructors, and indeed the
attached code breaks when that option is passed. As you said, the UB caused by
the distance computations of automatic objects seems to be the stumbling point.

[Bug tree-optimization/66713] New: atomic compare_excahnge_strong create spurious store for x86-64 at -O3

2015-06-30 Thread tkoeppe at google dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66713

Bug ID: 66713
   Summary: atomic compare_excahnge_strong create spurious store
for x86-64 at -O3
   Product: gcc
   Version: 4.9.2
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: tkoeppe at google dot com
  Target Milestone: ---

The code here: https://goo.gl/CiV4pl

compares a hand-written CAS with the C++11 atomic one. On Clang, the code comes
out identical. On GCC 4.9.2 and later there is an extra store ("movq %rdi,
-8(%rsp)"). Should that not be there?


[Bug middle-end/66713] atomic compare_exchange_strong creates spurious store for x86-64 at -O3

2015-07-01 Thread tkoeppe at google dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66713

--- Comment #3 from Thomas Köppe  ---
Note: The code in question, and the hand-written assembly, are taken from the
ZMQ library:

https://github.com/zeromq/libzmq/blob/master/src/atomic_ptr.hpp

I added the C++11 atomic support recenlty.

[Bug middle-end/66881] New: Possibly inefficient std::atomic codegen on x86 for simple arithmetic

2015-07-15 Thread tkoeppe at google dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66881

Bug ID: 66881
   Summary: Possibly inefficient std::atomic codegen on x86
for simple arithmetic
   Product: gcc
   Version: 4.9.2
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: tkoeppe at google dot com
  Target Milestone: ---

Consider these two simple versions of addition:

  #include 

  std::atomic x;
  int y;

  void f(int a) {
x.store(x.load(std::memory_order_relaxed) + a, std::memory_order_relaxed);
  }

  void g(int a) {
y += a;
  }

GCC generates the following assembly:

  f(int):
mov eax, DWORD PTR x[rip]
add edi, eax
mov DWORD PTR x[rip], edi
ret

  g(int):
add DWORD PTR y[rip], edi
ret

Now, it is clear to me that the correct atomic codegen for store() and load()
is "mov", as it appears here, but why aren't the two consecutive operations not
folded into a single add? Aren't the semantics and the memory ordering the
same? x86 says that (most) "reads" and "writes" are strongly ordered; doesn't
that apply to the read and write produced by "add", too?

(My original motivation came from a variant of this with floats, where the
non-atomic code executed noticeably faster, even though I would have expected
the two to produce the same machine code.)