[Bug target/88524] New: PLT32 relocation is off by 4

2018-12-16 Thread nruslan_devel at yahoo dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88524

Bug ID: 88524
   Summary: PLT32 relocation is off by 4
   Product: gcc
   Version: 7.3.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: nruslan_devel at yahoo dot com
  Target Milestone: ---

Consider the following example for some -fpic -mcmodel=small compiled code.
There is an external function func() for which we store a relative reference
to the corresponding @plt stub in a 32-bit variable.

The following seems to generate correct offsets (@plt is already relative, so
we can probably specify it directly):

void func(void);

asm("a: .long func@plt");

extern int a;

int geta()
{
return a;
}

gcc -Wall -O2 -c -fpic test.c

yields

RELOCATION RECORDS FOR [.text]:
OFFSET   TYPE  VALUE 
 R_X86_64_PLT32func
0013 R_X86_64_REX_GOTPCRELX  a-0x0004

However, if we change asm("a: .long func@plt") to asm("a: .long func@plt - .")
the generated code is very weird and is off by 4:

RELOCATION RECORDS FOR [.text]:
OFFSET   TYPE  VALUE 
 R_X86_64_PLT32func-0x0004
0013 R_X86_64_REX_GOTPCRELX  a-0x0004

Specifically, if we generate a shared library, the generated offset to func@plt
is off by 4 in the second case.

gcc -Wall -O2 -shared -fpic test.c

the first case is correct:
04c0 :
...
05c0 :
 5c0:   00 ff
 5c2:   ff   
 5c3:   ff

[5c0 + ff00] = 4C0


whereas the second case is off by 4:
04c0 :
...
05c0 :
 5c0:   fc
 5c1:   fe 
 5c2:   ff  
 5c3:   ff

[5c0 + fefc] = 4BC

It is quite possible that I am missing something here (and the generated code
is correct) but just want to check if there is any potential bug in the
compiler.

[Bug target/88524] PLT32 relocation is off by 4

2018-12-16 Thread nruslan_devel at yahoo dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88524

--- Comment #1 from Ruslan Nikolaev  ---
btw, I have just compared the output with clang/llvm 7.0; the generated code
seems to be correct in both cases there.

[Bug target/88524] PLT32 relocation is off by 4

2018-12-16 Thread nruslan_devel at yahoo dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88524

Ruslan Nikolaev  changed:

   What|Removed |Added

 Status|RESOLVED|UNCONFIRMED
 Resolution|MOVED   |---

--- Comment #3 from Ruslan Nikolaev  ---
(In reply to Andreas Schwab from comment #2)
> If anything, this is a problem in the linker, please report to the binutils
> project.

Why is it binutils? The problem already appears with when running with '-c'
flag (incorrect PLT32 relocation), i.e. not linking a shared library yet. I
have just gave that example to further demonstrate the problem.

[Bug target/88524] PLT32 relocation is off by 4

2018-12-16 Thread nruslan_devel at yahoo dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88524

--- Comment #5 from Ruslan Nikolaev  ---
(In reply to Andrew Pinski from comment #4)
> Still is a binutils (assembler rather than the linker issue):
> .file   "t.c"
> .text
> #APP
> a: .long func@plt - 4
> #NO_APP
> .globl  geta
> .type   geta, @function
> geta:
> .LFB0:
> .cfi_startproc
> pushq   %rbp
> .cfi_def_cfa_offset 16
> .cfi_offset 6, -16
> movq%rsp, %rbp
> .cfi_def_cfa_register 6
> movqa@GOTPCREL(%rip), %rax
> movl(%rax), %eax
> popq%rbp
> .cfi_def_cfa 7, 8
> ret
> .cfi_endproc
> .LFE0:
> .size   geta, .-geta
> .ident  "GCC: (Octeon TX GCC 7 - (Build 116)) 7.3.0"
> .section.note.GNU-stack,"",@progbits
> 
> 
> What GCC outputs does not have the -4 in there.

Ok, thanks!

[Bug c/85791] New: multiply overflow (128 bit)

2018-05-15 Thread nruslan_devel at yahoo dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85791

Bug ID: 85791
   Summary: multiply overflow (128 bit)
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: nruslan_devel at yahoo dot com
  Target Milestone: ---

Just noticed that code is generated differently when using
__builtin_mul_overflow and an explicit check:

1. unsigned long long func(unsigned long long a, unsigned long long b)
{
unsigned long long c;
if (__builtin_mul_overflow(a, b, &c))
return 0;
return c;
}

yields:
func:
.LFB0:
.cfi_startproc
movq%rdi, %rax
mulq%rsi
jo  .L7
rep ret
.L7:
xorl%eax, %eax
ret

2. unsigned long long func(unsigned long long a, unsigned long long b)
{
__uint128_t c = (__uint128_t) a * b;
if (c > (unsigned long long) -1LL) {
return 0;
}
return (unsigned long long) c;
}

yields slightly less efficient code:

func:
.LFB0:
.cfi_startproc
movq%rdi, %rax
mulq%rsi
cmpq$0, %rdx
jbe .L2
xorl%eax, %eax
.L2:
rep ret

3. clang/llvm can generate better code (identical) in both cases:

func:   # @func
.cfi_startproc
# %bb.0:
xorl%ecx, %ecx
movq%rdi, %rax
mulq%rsi
cmovoq  %rcx, %rax
retq

[Bug c/85791] multiply overflow (128 bit)

2018-05-15 Thread nruslan_devel at yahoo dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85791

--- Comment #1 from Ruslan Nikolaev  ---
Optimization flags -O2 in all cases

[Bug target/85791] multiply overflow (128 bit)

2018-05-15 Thread nruslan_devel at yahoo dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85791

--- Comment #3 from Ruslan Nikolaev  ---
That is OK, I was talking about an extra 'cmp' instruction for the case when
the check is explicit

[Bug target/90606] Replace mfence with faster xchg for std::memory_order_seq_cst.

2019-08-20 Thread nruslan_devel at yahoo dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90606

Ruslan Nikolaev  changed:

   What|Removed |Added

 CC||nruslan_devel at yahoo dot com

--- Comment #1 from Ruslan Nikolaev  ---
Yes, mfence is twice as slower on my machine as well. Btw, clang generates xchg
for the same code.

[Bug c/91502] New: suboptimal atomic_fetch_sub and atomic_fetch_add

2019-08-20 Thread nruslan_devel at yahoo dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91502

Bug ID: 91502
   Summary: suboptimal atomic_fetch_sub and atomic_fetch_add
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: nruslan_devel at yahoo dot com
  Target Milestone: ---

I have not specified the gcc version; it seems like it applies to any version.

I have noticed that if I write code:

#include 

int func(_Atomic(int) *a)
{
return (atomic_fetch_sub(a, 1) - 1 == 0);
}

gcc generates optimized code (gcc -O2):
func:
.LFB0:
.cfi_startproc
xorl%eax, %eax
lock subl   $1, (%rdi)
sete%al
ret

But when I change the condition to <= 0, it does not work. Correct me if I am
wrong, but, I think, it should still be able to use sub:

#include 

int func(_Atomic(int) *a)
{
return (atomic_fetch_sub(a, 1) - 1 <= 0);

}

func:
.LFB0:
.cfi_startproc
movl$-1, %eax
lock xaddl  %eax, (%rdi)
cmpl$1, %eax
setle   %al
movzbl  %al, %eax
ret

Seems like the same problem exists for atomic_fetch_add as well.

[Bug target/91502] suboptimal atomic_fetch_sub and atomic_fetch_add

2019-08-20 Thread nruslan_devel at yahoo dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91502

--- Comment #1 from Ruslan Nikolaev  ---
btw, the same problem for

#include 

int func(_Atomic(long) *a)
{
return (atomic_fetch_sub(a, 1) <= 0);
}

In the previous case clang/llvm was just like gcc, i.e., unable to optimize; in
this case clang/llvm was able to produce better code, but gcc still cannot.

[Bug target/86693] New: inefficient atomic_fetch_xor

2018-07-26 Thread nruslan_devel at yahoo dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86693

Bug ID: 86693
   Summary: inefficient atomic_fetch_xor
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: nruslan_devel at yahoo dot com
  Target Milestone: ---

(Compiled with O2 on x86-64)

Consider the following example:

void func1();

void func(unsigned long *counter)
{
if (__atomic_fetch_xor(counter, 1, __ATOMIC_ACQ_REL) == 1) {
func1();
}
}

It is clear that the code can be optimized to simply do 'lock xorq' rather than
cmpxchg loop since the xor operation can easily be inverted 1^1 = 0, i.e. can
be tested from flags directly (just like for similar cases with fetch_sub and
fetch_add which gcc optimizes well).

However, gcc currently generates cmpxchg loop:
func:
.LFB0:
.cfi_startproc
movq(%rdi), %rax
.L2:
movq%rax, %rcx
movq%rax, %rdx
xorq$1, %rcx
lock cmpxchgq   %rcx, (%rdi)
jne .L2
cmpq$1, %rdx
je  .L7
rep ret

Compare this with fetch_sub instead of fetch_xor:
func:
.LFB0:
.cfi_startproc
lock subq   $1, (%rdi)
je  .L4
rep ret

[Bug target/86693] inefficient atomic_fetch_xor

2018-07-28 Thread nruslan_devel at yahoo dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86693

--- Comment #2 from Ruslan Nikolaev  ---
Also may be (partially) related the following cases:

1.

#include 
#include 

void func2();

void func(_Atomic(unsigned long) * obj, void * obj2)
{
if (atomic_fetch_sub(obj, 1) == 1 && obj2)
func2();
}

generates 'xadd' when 'sub' suffices:
func:
.LFB0:
.cfi_startproc
movq$-1, %rax
lock xaddq  %rax, (%rdi)
testq   %rsi, %rsi
je  .L1
cmpq$1, %rax
je  .L10
.L1:
rep ret
.p2align 4,,10
.p2align 3
.L10:
xorl%eax, %eax
jmp func2@PLT


2.

#include 

int func(_Atomic(unsigned long) * obj, unsigned long a)
{
return atomic_fetch_add(obj, a) == -a;
}

generates 'xadd' when 'add' suffices:
func:
.LFB0:
.cfi_startproc
movq%rsi, %rax
lock xaddq  %rax, (%rdi)
addq%rsi, %rax
sete%al
movzbl  %al, %eax
ret

[Bug target/86693] inefficient atomic_fetch_xor

2018-07-28 Thread nruslan_devel at yahoo dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86693

--- Comment #3 from Ruslan Nikolaev  ---
(In reply to Jakub Jelinek from comment #1)
> The reason why this works for sub/add is that x86 has xadd instruction, so
> we expand it as xadd and later on during combine find out we are actually
> comparing the result of lock; xadd with something we can optimize better and
> do the optimization.
> For __atomic_fetch_xor (ptr, x, y) == x (or != x), or __atomic_xor_fetch
> (ptr, x, y) == 0 (or != 0), or __atomic_or_fetch (ptr, x, y) == 0 (or != 0),
> we'd need to handle this specially already at expansion time, so with extra
> special optabs, because there is no instruction that keeps the old or new
> value of xor or ior in a register, and once we emit a compare and exchange
> loop, it is very hard to optimize that to something different.

btw, do not know exactly how gcc handles it... Is it possible to emit an
artificial 'xxor' instruction which acts like xadd? Then during optimization,
xxor can be replaced to xor or to cmpxchg-loop depending on the circumstances.

[Bug c/84431] New: Suboptimal code for masked shifts (x86/x86-64)

2018-02-17 Thread nruslan_devel at yahoo dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84431

Bug ID: 84431
   Summary: Suboptimal code for masked shifts (x86/x86-64)
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: nruslan_devel at yahoo dot com
  Target Milestone: ---

In x86 and x86-64, the assumption is that upper bits of the CL register are
unused (i.e., masked) when doing a shift operation. It is not possible to do
shift for more than (WORD_BITS - 1) positions. Normally, the compiler has to
check whether the specified shift value exceeds the word size before generating
corresponding shld/shl commands (shrd/shr, etc).

Now, if the shift value is given by some variable, it is normally unknown at
compile time whether it is exceeding (WORD_BITS - 1), so the compiler has to
generate corresponding checks. On the other hand, it is very easy to give a
hint to the compiler (if it is known that the shift < WORD_BITS) by masking
shift value like this (the example below is for i386; for x86-64 the type will
be __uint128_t and mask 63):

unsigned long long func(unsigned long long a, unsigned shift)
{
   return a << (shift & 31);
}

In the ideal scenario, the compiler has to just load value to CL without even
masking it because it is implied already by the shift operation.

Note that clang/LLVM recognizes this pattern (at least for i386) by generating
the following assembly code:
func:   # @func
pushl   %esi
movl8(%esp), %esi
movb16(%esp), %cl
movl12(%esp), %edx
movl%esi, %eax
shldl   %cl, %esi, %edx
shll%cl, %eax
popl%esi
retl


GCC generates suboptimal code in this case:
func:
pushl   %esi
pushl   %ebx
movl20(%esp), %ecx
movl16(%esp), %esi
movl12(%esp), %ebx
andl$31, %ecx
movl%esi, %edx
shldl   %ebx, %edx
movl%ebx, %eax
xorl%ebx, %ebx
sall%cl, %eax
andl$32, %ecx
cmovne  %eax, %edx
cmovne  %ebx, %eax
popl%ebx
popl%esi
ret

[Bug c/84431] Suboptimal code for masked shifts (x86/x86-64)

2018-02-19 Thread nruslan_devel at yahoo dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84431

--- Comment #2 from Ruslan Nikolaev  ---
Sebastian, it is

gcc -m32 -Wall -O2 -S test.c

(the same is for clang)

[Bug target/84431] Suboptimal code for masked shifts (x86/x86-64)

2018-02-20 Thread nruslan_devel at yahoo dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84431

--- Comment #4 from Ruslan Nikolaev  ---
(In reply to Uroš Bizjak from comment #3)
> Created attachment 43471 [details]
> Prototype patch
> 
> Prototype patch, compiles the testcase to:
> 
> movl4(%esp), %eax
> movl12(%esp), %ecx
> movl8(%esp), %edx
> shldl   %eax, %edx
> sall%cl, %eax
> ret
> 
> The patch also handles right shifts and cases where mask is less than 31
> bits.

Thanks! I was wondering if the patch also fixes the same thing for x86-64
(i.e., -m64); in which case we would have something like this:

__uint128_t func(__uint128_t a, unsigned shift)
{
   return a << (shift & 63);
}

[Bug c/84522] New: GCC does not generate cmpxchg16b when mcx16 is used

2018-02-22 Thread nruslan_devel at yahoo dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84522

Bug ID: 84522
   Summary: GCC does not generate cmpxchg16b when mcx16 is used
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: nruslan_devel at yahoo dot com
  Target Milestone: ---

I looked up similar bugs, but I could not quite understand why it redirects to
libatomic when used with 128-bit cmpxchg in x86-64 even when '-mcx16' flag is
specified. Especially because similar cmpxchg8b for x86 (32-bit) is still used
without redirecting to libatomic.

80878 mentioned something about read-only memory, but that should only apply to
atomic_load, not atomic_compare_and_exchange. Right?

It is especially annoying because libatomic will not guarantee lock-freedom,
therefore, these functions become useless in many cases.
This compiler behavior is inconsistent with clang.

For instance, for the following code:

#include 

__uint128_t cmpxhg_weak(_Atomic(__uint128_t) * obj, __uint128_t * expected,
__uint128_t desired)
{
return atomic_compare_exchange_weak(obj, expected, desired);
}

GCC generates:

(gcc -std=c11 -mcx16 -Wall -O2 -S test.c)

cmpxhg_weak:
subq$8, %rsp
movl$5, %r9d
movl$5, %r8d
call__atomic_compare_exchange_16@PLT
xorl%edx, %edx
movzbl  %al, %eax
addq$8, %rsp
ret

While clang/llvm generates the code which is obviously lock-free:
cmpxhg_weak:# @cmpxhg_weak
pushq   %rbx
movq%rdx, %r8
movq(%rsi), %rax
movq8(%rsi), %rdx
xorl%r9d, %r9d
movq%r8, %rbx
lockcmpxchg16b  (%rdi)
sete%cl
je  .LBB0_2
movq%rax, (%rsi)
movq%rdx, 8(%rsi)
.LBB0_2:
movb%cl, %r9b
xorl%edx, %edx
movq%r9, %rax
popq%rbx
retq

However, for 32-bit GCC still generates cmpxchg8b:

#include 
#include 

uint64_t cmpxhg_weak(_Atomic(uint64_t) * obj, uint64_t * expected, uint64_t
desired)
{
return atomic_compare_exchange_weak(obj, expected, desired);
}

gcc -std=c11 -m32 -Wall -O2 -S test.c


cmpxhg_weak:
pushl   %edi
pushl   %esi
pushl   %ebx
movl20(%esp), %esi
movl24(%esp), %ebx
movl28(%esp), %ecx
movl16(%esp), %edi
movl(%esi), %eax
movl4(%esi), %edx
lock cmpxchg8b  (%edi)
movl%edx, %ecx
movl%eax, %edx
sete%al
je  .L2
movl%edx, (%esi)
movl%ecx, 4(%esi)
.L2:
popl%ebx
movzbl  %al, %eax
xorl%edx, %edx
popl%esi
popl%edi
ret

[Bug target/84522] GCC does not generate cmpxchg16b when mcx16 is used

2018-02-22 Thread nruslan_devel at yahoo dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84522

--- Comment #2 from Ruslan Nikolaev  ---
Yes, but not having atomic_load is far less an issue. Oftentimes, algorithms
that use 128-bit can simply use compare_and_exchange only (at least for
x86-64).

[Bug target/84522] GCC does not generate cmpxchg16b when mcx16 is used

2018-02-22 Thread nruslan_devel at yahoo dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84522

--- Comment #3 from Ruslan Nikolaev  ---
(In reply to Ruslan Nikolaev from comment #2)
> Yes, but not having atomic_load is far less an issue. Oftentimes, algorithms
> that use 128-bit can simply use compare_and_exchange only (at least for
> x86-64).

In other words, can atomic_load be redirected to libatomic while
compare_exchange still be generated directly (if -mcx16 is specified)?

[Bug target/84522] GCC does not generate cmpxchg16b when mcx16 is used

2018-02-22 Thread nruslan_devel at yahoo dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84522

--- Comment #4 from Ruslan Nikolaev  ---
I guess, in this case you would have to fall-back to lock-based implementation
for everything. But does C11 even require that atomic_load work on read-only
memory?

[Bug target/84522] GCC does not generate cmpxchg16b when mcx16 is used

2018-02-22 Thread nruslan_devel at yahoo dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84522

--- Comment #5 from Ruslan Nikolaev  ---
After more t(In reply to Andrew Pinski from comment #1)
> IIRC this was done because there is no atomic load/stores or a way to do
> backwards compatible.

After more thinking about it... Should not it be controlled by some flag
(similar to -mcx16 which enables cmpxchg16b)? This flag can basically say, that
atomic_load on 128-bit will not work on read-only memory. I think, it is better
than just unconditionally disabling lock-free implementation for 128-bit types
in C11 (which can is useful in a number of cases) just to accommodate some rare
cases when memory accesses must be read-only. That would also be more portable
and compatible with other compilers such as clang.

[Bug target/84431] Suboptimal code for masked shifts (x86/x86-64)

2018-02-24 Thread nruslan_devel at yahoo dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84431

--- Comment #6 from Ruslan Nikolaev  ---
(In reply to Uroš Bizjak from comment #5)
> (In reply to Ruslan Nikolaev from comment #4)
> > Thanks! I was wondering if the patch also fixes the same thing for x86-64
> > (i.e., -m64); in which case we would have something like this:
> > 
> > __uint128_t func(__uint128_t a, unsigned shift)
> > {
> >return a << (shift & 63);
> > }
> 
> Yes, the patch also handles __int128.

Great! Also, another interesting case (with the same idea for -m64 and
__uint128_t) would be this:


gcc -m32 -Wall -O2 -S test.c

unsigned func(unsigned long long a, unsigned shift)
{
return (unsigned) (a >> (shift & 31));
}

In this case, clang generates just a single 'shrd' instruction.

[Bug target/84547] New: Suboptimal code for masked shifts (ARM64)

2018-02-24 Thread nruslan_devel at yahoo dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84547

Bug ID: 84547
   Summary: Suboptimal code for masked shifts (ARM64)
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: nruslan_devel at yahoo dot com
  Target Milestone: ---

Partially related to the Bug 84431 (see description of the problem there) but
observed on ARM64 instead of x86/x86-64. (Not sure about ARM32.)

Test example:

__uint128_t func(__uint128_t a, unsigned shift)
{
   return a << (shift & 63);
}

aarch64-linux-gnu-gcc-7 -Wall -O2 -S test.c

GCC generates:
func:
and w2, w2, 63
mov w4, 63
sub w5, w4, w2
lsr x4, x0, 1
sub w3, w2, #64
lsl x1, x1, x2
cmp w3, 0
lsr x4, x4, x5
orr x1, x4, x1
lsl x4, x0, x3
lsl x0, x0, x2
cselx1, x4, x1, ge
cselx0, x0, xzr, lt
ret


While clang/llvm generates better code:

func:   // @func
// BB#0:
and w8, w2, #0x3f
lsr x9, x0, #1
eor x11, x8, #0x3f
lsl x10, x1, x8
lsr x9, x9, x11
orr x1, x10, x9
lsl x0, x0, x8
ret


Another interesting case when __builtin_unreachable() is used:

__uint128_t func(__uint128_t a, unsigned shift)
{
if (shift > 63)
__builtin_unreachable();
return a << shift;
}

But in this case, neither clang/llvm, nor gcc seem to be able to optimize code
well.

[Bug c/84563] New: GCC interpretation of C11 atomics (DR 459)

2018-02-25 Thread nruslan_devel at yahoo dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84563

Bug ID: 84563
   Summary: GCC interpretation of C11 atomics (DR 459)
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: nruslan_devel at yahoo dot com
  Target Milestone: ---

I know, there are related issues already but decided to open a new issue
because it primarily relates to the interpretation of DR 459 by GCC and
response from C11/WG14.

(I also posted the same on the mailing list.)

I have read multiple bug reports (Bug 84522, Bug 80878, Bug 70490), and a past
decision regarding GCC change to redirect double-width (128-bit) atomics for
x86-64 and arm64 to libatomic. Below I mention major concerns as well as the
response from C11 (WG14) regarding DR 459 which, most likely, triggered this
change in more recent GCC releases in the first place. 

If I understand correctly, the redirection to libatomic was made for 2 reasons:

1. cmpxchg16b is not available on early amd64 processors. (However, mcx16 flag
already specifies that you use CPUs that have this instruction, so it should
not be a concern when the flag is specified.)

2. atomic_load on read-only memory. DR 459 now requires to have 'const'
qualifiers for atomic_load which probably resulted in the interpretation that
read-only memory must be supported. However, per response from C11/WG14 (see
below), it does not seem to be the case at all. Therefore, previously filed Bug
70490 does not seem to be valid.

There are several concerns with current GCC behavior:

1. Not consistent with clang/llvm which completely supports double-width
atomics for arm32, arm64, x86 and x86-64 making it possible to write portable
code (w/o specific extensions or assembly code) across all these architectures
(which is finally possible with C11!).
The behavior of clang: if mxc16 is specified, cmpxchg16b is generated for
x86-64 (without any calls to libatomic), otherwise -- redirection to libatomic.
For arm64, ldaxp/staxp are always generated. In my opinion, this is very
logical and non-confusing.

2. Oftentimes you want to have strict guarantees (by specifying mcx16 flag for
x86-64) that the generated code is lock-free, otherwise it is useless.
Double-width atomics are often used in lock-free algorithms that use tags
(stamps) for pointers to resolve the ABA problem. So, it is very useful to have
corresponding support in the compiler.

3. The behavior is inconsistent even within GCC. Older (and more limited, less
portable, etc) __sync builtins still use cmpxchg16b directly, newer __atomic
and C11 -- do not. Moreover, __sync builtins are probably less suitable for
arm/arm64.

4. atomic_load can be implemented using read-modify-write as it is the only
option for x86-64 and arm64 (see below).

For these reasons, it may be a good idea if GCC folks reconsider past decision.
And just to clarify: if mcx16 (x86-64) is not specified during compilation, it
is totally OK to redirect to libatomic, and there make the final decision if
target CPU supports a given instruction or not. But if it is specified, it
makes sense for performance reasons and lock-freedom guarantees to always
generate it directly.

-- Ruslan

Response from the WG14 (C11) Convener regarding DR 459: (I asked for a
permission to publish this response here.)
-
Ruslan,

 Thank you for your comments.  There is no normative requirement that const
objects be suitable for read-only memory.  An example and a footnote refer to
read-only memory as a way to illustrate a point, but examples and footnotes are
not normative.  The actual nature of read-only memory and how it can be used
are outside the scope of the standard, so there is nothing to prevent
atomic_load from being implemented as a read-modify-write operation.

David
--


My original email:

--

Dear David Keaton,

After reviewing the proposed change DR 459 for C11: http://www.open-std.org/
jtc1/sc22/wg14/www/docs/ summary.htm#dr_459 ,
I identified that adding const qualifier to atomic_load (C11 implements its
without it) may actually be harmful in some cases.

Particularly, for double-width (128-bit) atomics found in x86-64 (cmpxchg16b
instruction), arm64 (ldaxp/staxp instructions), it is currently only possible
to implement atomic_load for 128 bit using corresponding read-modify-write
instructions (i.e., potentially rewriting memory with the same value, but, in
essence, not changing it). But these implementations will not work on read-only
memory. Similar concerns apply to some extent to x86 and arm32 for double-width
(64-bit) atomics. Otherwise, there is no obstacle to implement all C11 atomics
for corresponding types in these architectures. Moreover, a well-known
clang/llvm compiler already impl

[Bug target/70490] __atomic_load_n(const __int128 *, ...) generates CMPXCHG16B with no warning

2018-02-25 Thread nruslan_devel at yahoo dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70490

Ruslan Nikolaev  changed:

   What|Removed |Added

 CC||nruslan_devel at yahoo dot com

--- Comment #6 from Ruslan Nikolaev  ---
See also Bug 84563

[Bug c/84563] GCC interpretation of C11 atomics (DR 459)

2018-02-26 Thread nruslan_devel at yahoo dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84563

--- Comment #1 from Ruslan Nikolaev  ---
See also discussion in the gcc mailing list

[Bug c/84563] GCC interpretation of C11 atomics (DR 459)

2018-02-26 Thread nruslan_devel at yahoo dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84563

--- Comment #2 from Ruslan Nikolaev  ---
Summary (from the mailing list):

Pros of the proposed approach:
1. Ability to use guaranteed lock-free double-width atomics (when mcx16 is
specified for x86-64, and always for arm64) in more or less portable manner
across different supported architectures (without resorting to non-standard
extensions or writing separate assembly code for each architecture). Hopefully,
the behavior may also be made more or less consistent across different
compilers over time. It is already the case for clang/llvm. As mentioned,
double-width lock-free atomics have real practical use (ABA tags for pointers).

2. More likely to find a bug immediately if a programmer tries to do something
that is not guaranteed by the standard (i.e., getting segfault on read-only
memory when using double-width atomic_load). This is true even if mcx16 is not
used, as most CPUs have cmpxchg16b, and libatomic will use it.On the other
hand, atomic_load implemented through locks may have hard-to-find and debug
issues in signal handlers, interrupt contexts, etc when a programmer
erroneously assumes that atomic_load is non-blocking

3. For arm64 the corresponding instructions are always available, no need for
mcx16 flag or redirection to libatomic at all (libatomic may still keep old
implementation for backward compatibility).

4. Faster & easy to analyze code when mcx16 is specified.

5. Ability to tell for sure if the implementation is lock-free by checking
corresponding C11 flag when mcx16 is specified. When unspecified, the flag will
be false to accommodate the worst case scenario.

6. Consistent behavior everywhere on all platforms regardless of IFFUNC, mcx16
flag, etc. If cmpxchg16b is available, it is always used (platforms that do not
support IFFUNC will use function pointers for redirection). The only thing the
mcx16 flag changes is removing indirection to libatomic and giving guaranteed
lock_free flag for corresponding types. (BTW, in practice, if you use the flag,
you should know what you are doing already)

7. Ability to finally deprecate old __sync builtins, and use new and more
advanced __atomic everywhere.


Cons of the proposed approach:

1. Compiler may place const atomic objects to .rodata. (Avoided by making sure
_Atomic objects with the size > 8 are not placed in .rodata + clarifying that
casting random .rodata objects for double-width atomics is undefined and is not
allowed.)

2. Backward compatibility concerns if used outside glibc/IFFUNC. Most likely,
even in this case, not an issue since all calls there are already redirected to
libatomic anyway, and statically-linked binaries will not interact with new
binaries directly.

3. Read-only memory for atomic_load will not be supported for double-width
types. But it is actually better than hiding the problem under the carpet
(current behavior is actually even worse because it is inconsistent across
different platforms, i.e. different for x86-64 in Linux and arm64). Anyway, it
is better to use a lock-based approach explicitly if for whatever reason it is
more preferable (read-only memory, performance (?), etc).

[Bug target/80878] -mcx16 (enable 128 bit CAS) on x86_64 seems not to work on 7.1.0

2018-04-05 Thread nruslan_devel at yahoo dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80878

--- Comment #18 from Ruslan Nikolaev  ---
(In reply to andysem from comment #17)
> I'll clarify why I think load() should be allowed to issue writes on the
> memory. According to [atomics.types.operations]/18 in N4713,
> compare_exchange_*() is a load operation if the comparison fails, yet we
> know cmpxchg (even the ones more narrow than cmpxchg16b) always writes, so
> we must assume a load operation may write. I do not find a definition of a
> "load operation" in the standard and [atomics.types.operations]/12 and 13
> avoid this term, saying that load() "Atomically returns the value pointed to
> by this." Again, it doesn't say anything about writes to the memory.
> 
> So, if compare_exchange_*() is allowed to write on failure, why load()
> shouldn't be? Either compare_exchange_*() issuing writes is a bug (in which
> case a lock-free CAS can't be implemented on x86 at all) or writes in load()
> should be allowed and the change wrt. cmpxchg16b should be reverted.

I think, there is way too much over-thinking about read-only case for 128-bit
atomics. Current solution is very confusing and not very well documented at the
very least. Correct me if I am wrong, but does current solution guarantee
address-freedom? If not, what is the motivation to support 128-bit read only
atomics? The only practical use case seems to be IPC where one process has a
read-only access. If not guaranteed for 128-bit, why even bother to support
read-only case which is a) not guaranteed to be lock-free b) works only within
a single process where it is easy to control read-only behavior.

I really prefer the way it was implemented in clang. It is only redirecting if
-mcx16 is not specified. BTW, it also provides very nice implementation for
Aarch64 which GCC is also lacking.

[Bug c++/107958] New: Ambiguity with uniform initialization in overloaded operator and explicit constructor

2022-12-03 Thread nruslan_devel at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107958

Bug ID: 107958
   Summary: Ambiguity with uniform initialization in overloaded
operator and explicit constructor
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: nruslan_devel at yahoo dot com
  Target Milestone: ---

Suppose we have the following example with a class that has to keep two
pointers. (I actually encountered this error with a more complex example, but
this one is just for illustration.) The problem arises when I attempt to use
the assignment operator and curly braces.

If I understand correctly, two possibilities exist when passing curly braces:

1. Use the overloaded operator= (implicitly convert curly braces to a pair). In
this particular example, we could have probably used make_pair but I
deliberately put curly braces to show how this error is triggered.

2. Use the constructor to create a new PairPtr instance and then copy it to the
old object through operator=

Both clang and gcc complain unless I mark the corresponding constructor as
'explicit'. To avoid the ambiguity with the second case, I mark the
corresponding constructor as 'explicit' and expect that the overloaded
operator= to be used. That seems to work with clang/llvm but not with gcc (see
the error below).

#include 
#include 

struct PairPtr {

PairPtr() {}

PairPtr(const PairPtr &s) {
a = s.a;
b = s.b;
}

explicit PairPtr(int *_a, int *_b) {
a = _a;
b = _b;
}

PairPtr& operator=(const PairPtr &s) {
a = s.a;
b = s.b;
return *this;
}

PairPtr& operator=(const std::pair& pair) {
a = pair.first;
b = pair.second;
return *this;
}

int *a;
int *b;
};

void func(int *a, int *b)
{
PairPtr p;

p = { a, b };
}


Error (note that clang/llvm can compile the above code successfully!):

Note that 'explicit' for the constructor fixes the problem for clang/llvm but
does not fix the problem for gcc.

2.cpp: In function ‘void func(int*, int*)’:
2.cpp:38:20: error: ambiguous overload for ‘operator=’ (operand types are
‘PairPtr’ and ‘’)
   38 | p = { a, b };
  |^
2.cpp:18:18: note: candidate: ‘PairPtr& PairPtr::operator=(const PairPtr&)’
   18 | PairPtr& operator=(const PairPtr &s) {
  |  ^~~~
2.cpp:24:18: note: candidate: ‘PairPtr& PairPtr::operator=(const
std::pair&)’
   24 | PairPtr& operator=(const std::pair& pair) {
  |

[Bug c++/107958] Ambiguity with uniform initialization in overloaded operator and explicit constructor

2022-12-03 Thread nruslan_devel at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107958

--- Comment #9 from Ruslan Nikolaev  ---
Interestingly, if I change the code a little bit and have a pair in the
constructor rather than two arguments, gcc seems to compile the code:

#include 
#include 

struct PairPtr {

PairPtr() {}

PairPtr(const PairPtr &s) {
a = s.a;
b = s.b;
}

explicit PairPtr(const std::pair& pair) {
a = pair.first;
b = pair.second;
}

PairPtr& operator=(const PairPtr &s) {
a = s.a;
b = s.b;
return *this;
}

PairPtr& operator=(const std::pair& pair) {
a = pair.first;
b = pair.second;
return *this;
}

private:
int *a;
int *b;
};

void func(int *a, int *b)
{
PairPtr p({a,b}); // works

p = { a, b }; // also works
}

[Bug c++/107958] Ambiguity with uniform initialization in overloaded operator and explicit constructor

2022-12-03 Thread nruslan_devel at yahoo dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107958

--- Comment #10 from Ruslan Nikolaev  ---
The latter example seems to work well for both gcc and clang. The behavior is
also consistent for both explicit and implicit constructors.

Thank you for clarifying that it was not a bug!