[Bug target/88524] New: PLT32 relocation is off by 4
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88524 Bug ID: 88524 Summary: PLT32 relocation is off by 4 Product: gcc Version: 7.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: nruslan_devel at yahoo dot com Target Milestone: --- Consider the following example for some -fpic -mcmodel=small compiled code. There is an external function func() for which we store a relative reference to the corresponding @plt stub in a 32-bit variable. The following seems to generate correct offsets (@plt is already relative, so we can probably specify it directly): void func(void); asm("a: .long func@plt"); extern int a; int geta() { return a; } gcc -Wall -O2 -c -fpic test.c yields RELOCATION RECORDS FOR [.text]: OFFSET TYPE VALUE R_X86_64_PLT32func 0013 R_X86_64_REX_GOTPCRELX a-0x0004 However, if we change asm("a: .long func@plt") to asm("a: .long func@plt - .") the generated code is very weird and is off by 4: RELOCATION RECORDS FOR [.text]: OFFSET TYPE VALUE R_X86_64_PLT32func-0x0004 0013 R_X86_64_REX_GOTPCRELX a-0x0004 Specifically, if we generate a shared library, the generated offset to func@plt is off by 4 in the second case. gcc -Wall -O2 -shared -fpic test.c the first case is correct: 04c0 : ... 05c0 : 5c0: 00 ff 5c2: ff 5c3: ff [5c0 + ff00] = 4C0 whereas the second case is off by 4: 04c0 : ... 05c0 : 5c0: fc 5c1: fe 5c2: ff 5c3: ff [5c0 + fefc] = 4BC It is quite possible that I am missing something here (and the generated code is correct) but just want to check if there is any potential bug in the compiler.
[Bug target/88524] PLT32 relocation is off by 4
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88524 --- Comment #1 from Ruslan Nikolaev --- btw, I have just compared the output with clang/llvm 7.0; the generated code seems to be correct in both cases there.
[Bug target/88524] PLT32 relocation is off by 4
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88524 Ruslan Nikolaev changed: What|Removed |Added Status|RESOLVED|UNCONFIRMED Resolution|MOVED |--- --- Comment #3 from Ruslan Nikolaev --- (In reply to Andreas Schwab from comment #2) > If anything, this is a problem in the linker, please report to the binutils > project. Why is it binutils? The problem already appears with when running with '-c' flag (incorrect PLT32 relocation), i.e. not linking a shared library yet. I have just gave that example to further demonstrate the problem.
[Bug target/88524] PLT32 relocation is off by 4
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88524 --- Comment #5 from Ruslan Nikolaev --- (In reply to Andrew Pinski from comment #4) > Still is a binutils (assembler rather than the linker issue): > .file "t.c" > .text > #APP > a: .long func@plt - 4 > #NO_APP > .globl geta > .type geta, @function > geta: > .LFB0: > .cfi_startproc > pushq %rbp > .cfi_def_cfa_offset 16 > .cfi_offset 6, -16 > movq%rsp, %rbp > .cfi_def_cfa_register 6 > movqa@GOTPCREL(%rip), %rax > movl(%rax), %eax > popq%rbp > .cfi_def_cfa 7, 8 > ret > .cfi_endproc > .LFE0: > .size geta, .-geta > .ident "GCC: (Octeon TX GCC 7 - (Build 116)) 7.3.0" > .section.note.GNU-stack,"",@progbits > > > What GCC outputs does not have the -4 in there. Ok, thanks!
[Bug c/85791] New: multiply overflow (128 bit)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85791 Bug ID: 85791 Summary: multiply overflow (128 bit) Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: nruslan_devel at yahoo dot com Target Milestone: --- Just noticed that code is generated differently when using __builtin_mul_overflow and an explicit check: 1. unsigned long long func(unsigned long long a, unsigned long long b) { unsigned long long c; if (__builtin_mul_overflow(a, b, &c)) return 0; return c; } yields: func: .LFB0: .cfi_startproc movq%rdi, %rax mulq%rsi jo .L7 rep ret .L7: xorl%eax, %eax ret 2. unsigned long long func(unsigned long long a, unsigned long long b) { __uint128_t c = (__uint128_t) a * b; if (c > (unsigned long long) -1LL) { return 0; } return (unsigned long long) c; } yields slightly less efficient code: func: .LFB0: .cfi_startproc movq%rdi, %rax mulq%rsi cmpq$0, %rdx jbe .L2 xorl%eax, %eax .L2: rep ret 3. clang/llvm can generate better code (identical) in both cases: func: # @func .cfi_startproc # %bb.0: xorl%ecx, %ecx movq%rdi, %rax mulq%rsi cmovoq %rcx, %rax retq
[Bug c/85791] multiply overflow (128 bit)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85791 --- Comment #1 from Ruslan Nikolaev --- Optimization flags -O2 in all cases
[Bug target/85791] multiply overflow (128 bit)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85791 --- Comment #3 from Ruslan Nikolaev --- That is OK, I was talking about an extra 'cmp' instruction for the case when the check is explicit
[Bug target/90606] Replace mfence with faster xchg for std::memory_order_seq_cst.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90606 Ruslan Nikolaev changed: What|Removed |Added CC||nruslan_devel at yahoo dot com --- Comment #1 from Ruslan Nikolaev --- Yes, mfence is twice as slower on my machine as well. Btw, clang generates xchg for the same code.
[Bug c/91502] New: suboptimal atomic_fetch_sub and atomic_fetch_add
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91502 Bug ID: 91502 Summary: suboptimal atomic_fetch_sub and atomic_fetch_add Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: nruslan_devel at yahoo dot com Target Milestone: --- I have not specified the gcc version; it seems like it applies to any version. I have noticed that if I write code: #include int func(_Atomic(int) *a) { return (atomic_fetch_sub(a, 1) - 1 == 0); } gcc generates optimized code (gcc -O2): func: .LFB0: .cfi_startproc xorl%eax, %eax lock subl $1, (%rdi) sete%al ret But when I change the condition to <= 0, it does not work. Correct me if I am wrong, but, I think, it should still be able to use sub: #include int func(_Atomic(int) *a) { return (atomic_fetch_sub(a, 1) - 1 <= 0); } func: .LFB0: .cfi_startproc movl$-1, %eax lock xaddl %eax, (%rdi) cmpl$1, %eax setle %al movzbl %al, %eax ret Seems like the same problem exists for atomic_fetch_add as well.
[Bug target/91502] suboptimal atomic_fetch_sub and atomic_fetch_add
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91502 --- Comment #1 from Ruslan Nikolaev --- btw, the same problem for #include int func(_Atomic(long) *a) { return (atomic_fetch_sub(a, 1) <= 0); } In the previous case clang/llvm was just like gcc, i.e., unable to optimize; in this case clang/llvm was able to produce better code, but gcc still cannot.
[Bug target/86693] New: inefficient atomic_fetch_xor
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86693 Bug ID: 86693 Summary: inefficient atomic_fetch_xor Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: nruslan_devel at yahoo dot com Target Milestone: --- (Compiled with O2 on x86-64) Consider the following example: void func1(); void func(unsigned long *counter) { if (__atomic_fetch_xor(counter, 1, __ATOMIC_ACQ_REL) == 1) { func1(); } } It is clear that the code can be optimized to simply do 'lock xorq' rather than cmpxchg loop since the xor operation can easily be inverted 1^1 = 0, i.e. can be tested from flags directly (just like for similar cases with fetch_sub and fetch_add which gcc optimizes well). However, gcc currently generates cmpxchg loop: func: .LFB0: .cfi_startproc movq(%rdi), %rax .L2: movq%rax, %rcx movq%rax, %rdx xorq$1, %rcx lock cmpxchgq %rcx, (%rdi) jne .L2 cmpq$1, %rdx je .L7 rep ret Compare this with fetch_sub instead of fetch_xor: func: .LFB0: .cfi_startproc lock subq $1, (%rdi) je .L4 rep ret
[Bug target/86693] inefficient atomic_fetch_xor
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86693 --- Comment #2 from Ruslan Nikolaev --- Also may be (partially) related the following cases: 1. #include #include void func2(); void func(_Atomic(unsigned long) * obj, void * obj2) { if (atomic_fetch_sub(obj, 1) == 1 && obj2) func2(); } generates 'xadd' when 'sub' suffices: func: .LFB0: .cfi_startproc movq$-1, %rax lock xaddq %rax, (%rdi) testq %rsi, %rsi je .L1 cmpq$1, %rax je .L10 .L1: rep ret .p2align 4,,10 .p2align 3 .L10: xorl%eax, %eax jmp func2@PLT 2. #include int func(_Atomic(unsigned long) * obj, unsigned long a) { return atomic_fetch_add(obj, a) == -a; } generates 'xadd' when 'add' suffices: func: .LFB0: .cfi_startproc movq%rsi, %rax lock xaddq %rax, (%rdi) addq%rsi, %rax sete%al movzbl %al, %eax ret
[Bug target/86693] inefficient atomic_fetch_xor
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86693 --- Comment #3 from Ruslan Nikolaev --- (In reply to Jakub Jelinek from comment #1) > The reason why this works for sub/add is that x86 has xadd instruction, so > we expand it as xadd and later on during combine find out we are actually > comparing the result of lock; xadd with something we can optimize better and > do the optimization. > For __atomic_fetch_xor (ptr, x, y) == x (or != x), or __atomic_xor_fetch > (ptr, x, y) == 0 (or != 0), or __atomic_or_fetch (ptr, x, y) == 0 (or != 0), > we'd need to handle this specially already at expansion time, so with extra > special optabs, because there is no instruction that keeps the old or new > value of xor or ior in a register, and once we emit a compare and exchange > loop, it is very hard to optimize that to something different. btw, do not know exactly how gcc handles it... Is it possible to emit an artificial 'xxor' instruction which acts like xadd? Then during optimization, xxor can be replaced to xor or to cmpxchg-loop depending on the circumstances.
[Bug c/84431] New: Suboptimal code for masked shifts (x86/x86-64)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84431 Bug ID: 84431 Summary: Suboptimal code for masked shifts (x86/x86-64) Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: nruslan_devel at yahoo dot com Target Milestone: --- In x86 and x86-64, the assumption is that upper bits of the CL register are unused (i.e., masked) when doing a shift operation. It is not possible to do shift for more than (WORD_BITS - 1) positions. Normally, the compiler has to check whether the specified shift value exceeds the word size before generating corresponding shld/shl commands (shrd/shr, etc). Now, if the shift value is given by some variable, it is normally unknown at compile time whether it is exceeding (WORD_BITS - 1), so the compiler has to generate corresponding checks. On the other hand, it is very easy to give a hint to the compiler (if it is known that the shift < WORD_BITS) by masking shift value like this (the example below is for i386; for x86-64 the type will be __uint128_t and mask 63): unsigned long long func(unsigned long long a, unsigned shift) { return a << (shift & 31); } In the ideal scenario, the compiler has to just load value to CL without even masking it because it is implied already by the shift operation. Note that clang/LLVM recognizes this pattern (at least for i386) by generating the following assembly code: func: # @func pushl %esi movl8(%esp), %esi movb16(%esp), %cl movl12(%esp), %edx movl%esi, %eax shldl %cl, %esi, %edx shll%cl, %eax popl%esi retl GCC generates suboptimal code in this case: func: pushl %esi pushl %ebx movl20(%esp), %ecx movl16(%esp), %esi movl12(%esp), %ebx andl$31, %ecx movl%esi, %edx shldl %ebx, %edx movl%ebx, %eax xorl%ebx, %ebx sall%cl, %eax andl$32, %ecx cmovne %eax, %edx cmovne %ebx, %eax popl%ebx popl%esi ret
[Bug c/84431] Suboptimal code for masked shifts (x86/x86-64)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84431 --- Comment #2 from Ruslan Nikolaev --- Sebastian, it is gcc -m32 -Wall -O2 -S test.c (the same is for clang)
[Bug target/84431] Suboptimal code for masked shifts (x86/x86-64)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84431 --- Comment #4 from Ruslan Nikolaev --- (In reply to Uroš Bizjak from comment #3) > Created attachment 43471 [details] > Prototype patch > > Prototype patch, compiles the testcase to: > > movl4(%esp), %eax > movl12(%esp), %ecx > movl8(%esp), %edx > shldl %eax, %edx > sall%cl, %eax > ret > > The patch also handles right shifts and cases where mask is less than 31 > bits. Thanks! I was wondering if the patch also fixes the same thing for x86-64 (i.e., -m64); in which case we would have something like this: __uint128_t func(__uint128_t a, unsigned shift) { return a << (shift & 63); }
[Bug c/84522] New: GCC does not generate cmpxchg16b when mcx16 is used
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84522 Bug ID: 84522 Summary: GCC does not generate cmpxchg16b when mcx16 is used Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: nruslan_devel at yahoo dot com Target Milestone: --- I looked up similar bugs, but I could not quite understand why it redirects to libatomic when used with 128-bit cmpxchg in x86-64 even when '-mcx16' flag is specified. Especially because similar cmpxchg8b for x86 (32-bit) is still used without redirecting to libatomic. 80878 mentioned something about read-only memory, but that should only apply to atomic_load, not atomic_compare_and_exchange. Right? It is especially annoying because libatomic will not guarantee lock-freedom, therefore, these functions become useless in many cases. This compiler behavior is inconsistent with clang. For instance, for the following code: #include __uint128_t cmpxhg_weak(_Atomic(__uint128_t) * obj, __uint128_t * expected, __uint128_t desired) { return atomic_compare_exchange_weak(obj, expected, desired); } GCC generates: (gcc -std=c11 -mcx16 -Wall -O2 -S test.c) cmpxhg_weak: subq$8, %rsp movl$5, %r9d movl$5, %r8d call__atomic_compare_exchange_16@PLT xorl%edx, %edx movzbl %al, %eax addq$8, %rsp ret While clang/llvm generates the code which is obviously lock-free: cmpxhg_weak:# @cmpxhg_weak pushq %rbx movq%rdx, %r8 movq(%rsi), %rax movq8(%rsi), %rdx xorl%r9d, %r9d movq%r8, %rbx lockcmpxchg16b (%rdi) sete%cl je .LBB0_2 movq%rax, (%rsi) movq%rdx, 8(%rsi) .LBB0_2: movb%cl, %r9b xorl%edx, %edx movq%r9, %rax popq%rbx retq However, for 32-bit GCC still generates cmpxchg8b: #include #include uint64_t cmpxhg_weak(_Atomic(uint64_t) * obj, uint64_t * expected, uint64_t desired) { return atomic_compare_exchange_weak(obj, expected, desired); } gcc -std=c11 -m32 -Wall -O2 -S test.c cmpxhg_weak: pushl %edi pushl %esi pushl %ebx movl20(%esp), %esi movl24(%esp), %ebx movl28(%esp), %ecx movl16(%esp), %edi movl(%esi), %eax movl4(%esi), %edx lock cmpxchg8b (%edi) movl%edx, %ecx movl%eax, %edx sete%al je .L2 movl%edx, (%esi) movl%ecx, 4(%esi) .L2: popl%ebx movzbl %al, %eax xorl%edx, %edx popl%esi popl%edi ret
[Bug target/84522] GCC does not generate cmpxchg16b when mcx16 is used
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84522 --- Comment #2 from Ruslan Nikolaev --- Yes, but not having atomic_load is far less an issue. Oftentimes, algorithms that use 128-bit can simply use compare_and_exchange only (at least for x86-64).
[Bug target/84522] GCC does not generate cmpxchg16b when mcx16 is used
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84522 --- Comment #3 from Ruslan Nikolaev --- (In reply to Ruslan Nikolaev from comment #2) > Yes, but not having atomic_load is far less an issue. Oftentimes, algorithms > that use 128-bit can simply use compare_and_exchange only (at least for > x86-64). In other words, can atomic_load be redirected to libatomic while compare_exchange still be generated directly (if -mcx16 is specified)?
[Bug target/84522] GCC does not generate cmpxchg16b when mcx16 is used
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84522 --- Comment #4 from Ruslan Nikolaev --- I guess, in this case you would have to fall-back to lock-based implementation for everything. But does C11 even require that atomic_load work on read-only memory?
[Bug target/84522] GCC does not generate cmpxchg16b when mcx16 is used
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84522 --- Comment #5 from Ruslan Nikolaev --- After more t(In reply to Andrew Pinski from comment #1) > IIRC this was done because there is no atomic load/stores or a way to do > backwards compatible. After more thinking about it... Should not it be controlled by some flag (similar to -mcx16 which enables cmpxchg16b)? This flag can basically say, that atomic_load on 128-bit will not work on read-only memory. I think, it is better than just unconditionally disabling lock-free implementation for 128-bit types in C11 (which can is useful in a number of cases) just to accommodate some rare cases when memory accesses must be read-only. That would also be more portable and compatible with other compilers such as clang.
[Bug target/84431] Suboptimal code for masked shifts (x86/x86-64)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84431 --- Comment #6 from Ruslan Nikolaev --- (In reply to Uroš Bizjak from comment #5) > (In reply to Ruslan Nikolaev from comment #4) > > Thanks! I was wondering if the patch also fixes the same thing for x86-64 > > (i.e., -m64); in which case we would have something like this: > > > > __uint128_t func(__uint128_t a, unsigned shift) > > { > >return a << (shift & 63); > > } > > Yes, the patch also handles __int128. Great! Also, another interesting case (with the same idea for -m64 and __uint128_t) would be this: gcc -m32 -Wall -O2 -S test.c unsigned func(unsigned long long a, unsigned shift) { return (unsigned) (a >> (shift & 31)); } In this case, clang generates just a single 'shrd' instruction.
[Bug target/84547] New: Suboptimal code for masked shifts (ARM64)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84547 Bug ID: 84547 Summary: Suboptimal code for masked shifts (ARM64) Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: nruslan_devel at yahoo dot com Target Milestone: --- Partially related to the Bug 84431 (see description of the problem there) but observed on ARM64 instead of x86/x86-64. (Not sure about ARM32.) Test example: __uint128_t func(__uint128_t a, unsigned shift) { return a << (shift & 63); } aarch64-linux-gnu-gcc-7 -Wall -O2 -S test.c GCC generates: func: and w2, w2, 63 mov w4, 63 sub w5, w4, w2 lsr x4, x0, 1 sub w3, w2, #64 lsl x1, x1, x2 cmp w3, 0 lsr x4, x4, x5 orr x1, x4, x1 lsl x4, x0, x3 lsl x0, x0, x2 cselx1, x4, x1, ge cselx0, x0, xzr, lt ret While clang/llvm generates better code: func: // @func // BB#0: and w8, w2, #0x3f lsr x9, x0, #1 eor x11, x8, #0x3f lsl x10, x1, x8 lsr x9, x9, x11 orr x1, x10, x9 lsl x0, x0, x8 ret Another interesting case when __builtin_unreachable() is used: __uint128_t func(__uint128_t a, unsigned shift) { if (shift > 63) __builtin_unreachable(); return a << shift; } But in this case, neither clang/llvm, nor gcc seem to be able to optimize code well.
[Bug c/84563] New: GCC interpretation of C11 atomics (DR 459)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84563 Bug ID: 84563 Summary: GCC interpretation of C11 atomics (DR 459) Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: nruslan_devel at yahoo dot com Target Milestone: --- I know, there are related issues already but decided to open a new issue because it primarily relates to the interpretation of DR 459 by GCC and response from C11/WG14. (I also posted the same on the mailing list.) I have read multiple bug reports (Bug 84522, Bug 80878, Bug 70490), and a past decision regarding GCC change to redirect double-width (128-bit) atomics for x86-64 and arm64 to libatomic. Below I mention major concerns as well as the response from C11 (WG14) regarding DR 459 which, most likely, triggered this change in more recent GCC releases in the first place. If I understand correctly, the redirection to libatomic was made for 2 reasons: 1. cmpxchg16b is not available on early amd64 processors. (However, mcx16 flag already specifies that you use CPUs that have this instruction, so it should not be a concern when the flag is specified.) 2. atomic_load on read-only memory. DR 459 now requires to have 'const' qualifiers for atomic_load which probably resulted in the interpretation that read-only memory must be supported. However, per response from C11/WG14 (see below), it does not seem to be the case at all. Therefore, previously filed Bug 70490 does not seem to be valid. There are several concerns with current GCC behavior: 1. Not consistent with clang/llvm which completely supports double-width atomics for arm32, arm64, x86 and x86-64 making it possible to write portable code (w/o specific extensions or assembly code) across all these architectures (which is finally possible with C11!). The behavior of clang: if mxc16 is specified, cmpxchg16b is generated for x86-64 (without any calls to libatomic), otherwise -- redirection to libatomic. For arm64, ldaxp/staxp are always generated. In my opinion, this is very logical and non-confusing. 2. Oftentimes you want to have strict guarantees (by specifying mcx16 flag for x86-64) that the generated code is lock-free, otherwise it is useless. Double-width atomics are often used in lock-free algorithms that use tags (stamps) for pointers to resolve the ABA problem. So, it is very useful to have corresponding support in the compiler. 3. The behavior is inconsistent even within GCC. Older (and more limited, less portable, etc) __sync builtins still use cmpxchg16b directly, newer __atomic and C11 -- do not. Moreover, __sync builtins are probably less suitable for arm/arm64. 4. atomic_load can be implemented using read-modify-write as it is the only option for x86-64 and arm64 (see below). For these reasons, it may be a good idea if GCC folks reconsider past decision. And just to clarify: if mcx16 (x86-64) is not specified during compilation, it is totally OK to redirect to libatomic, and there make the final decision if target CPU supports a given instruction or not. But if it is specified, it makes sense for performance reasons and lock-freedom guarantees to always generate it directly. -- Ruslan Response from the WG14 (C11) Convener regarding DR 459: (I asked for a permission to publish this response here.) - Ruslan, Thank you for your comments. There is no normative requirement that const objects be suitable for read-only memory. An example and a footnote refer to read-only memory as a way to illustrate a point, but examples and footnotes are not normative. The actual nature of read-only memory and how it can be used are outside the scope of the standard, so there is nothing to prevent atomic_load from being implemented as a read-modify-write operation. David -- My original email: -- Dear David Keaton, After reviewing the proposed change DR 459 for C11: http://www.open-std.org/ jtc1/sc22/wg14/www/docs/ summary.htm#dr_459 , I identified that adding const qualifier to atomic_load (C11 implements its without it) may actually be harmful in some cases. Particularly, for double-width (128-bit) atomics found in x86-64 (cmpxchg16b instruction), arm64 (ldaxp/staxp instructions), it is currently only possible to implement atomic_load for 128 bit using corresponding read-modify-write instructions (i.e., potentially rewriting memory with the same value, but, in essence, not changing it). But these implementations will not work on read-only memory. Similar concerns apply to some extent to x86 and arm32 for double-width (64-bit) atomics. Otherwise, there is no obstacle to implement all C11 atomics for corresponding types in these architectures. Moreover, a well-known clang/llvm compiler already impl
[Bug target/70490] __atomic_load_n(const __int128 *, ...) generates CMPXCHG16B with no warning
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70490 Ruslan Nikolaev changed: What|Removed |Added CC||nruslan_devel at yahoo dot com --- Comment #6 from Ruslan Nikolaev --- See also Bug 84563
[Bug c/84563] GCC interpretation of C11 atomics (DR 459)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84563 --- Comment #1 from Ruslan Nikolaev --- See also discussion in the gcc mailing list
[Bug c/84563] GCC interpretation of C11 atomics (DR 459)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84563 --- Comment #2 from Ruslan Nikolaev --- Summary (from the mailing list): Pros of the proposed approach: 1. Ability to use guaranteed lock-free double-width atomics (when mcx16 is specified for x86-64, and always for arm64) in more or less portable manner across different supported architectures (without resorting to non-standard extensions or writing separate assembly code for each architecture). Hopefully, the behavior may also be made more or less consistent across different compilers over time. It is already the case for clang/llvm. As mentioned, double-width lock-free atomics have real practical use (ABA tags for pointers). 2. More likely to find a bug immediately if a programmer tries to do something that is not guaranteed by the standard (i.e., getting segfault on read-only memory when using double-width atomic_load). This is true even if mcx16 is not used, as most CPUs have cmpxchg16b, and libatomic will use it.On the other hand, atomic_load implemented through locks may have hard-to-find and debug issues in signal handlers, interrupt contexts, etc when a programmer erroneously assumes that atomic_load is non-blocking 3. For arm64 the corresponding instructions are always available, no need for mcx16 flag or redirection to libatomic at all (libatomic may still keep old implementation for backward compatibility). 4. Faster & easy to analyze code when mcx16 is specified. 5. Ability to tell for sure if the implementation is lock-free by checking corresponding C11 flag when mcx16 is specified. When unspecified, the flag will be false to accommodate the worst case scenario. 6. Consistent behavior everywhere on all platforms regardless of IFFUNC, mcx16 flag, etc. If cmpxchg16b is available, it is always used (platforms that do not support IFFUNC will use function pointers for redirection). The only thing the mcx16 flag changes is removing indirection to libatomic and giving guaranteed lock_free flag for corresponding types. (BTW, in practice, if you use the flag, you should know what you are doing already) 7. Ability to finally deprecate old __sync builtins, and use new and more advanced __atomic everywhere. Cons of the proposed approach: 1. Compiler may place const atomic objects to .rodata. (Avoided by making sure _Atomic objects with the size > 8 are not placed in .rodata + clarifying that casting random .rodata objects for double-width atomics is undefined and is not allowed.) 2. Backward compatibility concerns if used outside glibc/IFFUNC. Most likely, even in this case, not an issue since all calls there are already redirected to libatomic anyway, and statically-linked binaries will not interact with new binaries directly. 3. Read-only memory for atomic_load will not be supported for double-width types. But it is actually better than hiding the problem under the carpet (current behavior is actually even worse because it is inconsistent across different platforms, i.e. different for x86-64 in Linux and arm64). Anyway, it is better to use a lock-based approach explicitly if for whatever reason it is more preferable (read-only memory, performance (?), etc).
[Bug target/80878] -mcx16 (enable 128 bit CAS) on x86_64 seems not to work on 7.1.0
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80878 --- Comment #18 from Ruslan Nikolaev --- (In reply to andysem from comment #17) > I'll clarify why I think load() should be allowed to issue writes on the > memory. According to [atomics.types.operations]/18 in N4713, > compare_exchange_*() is a load operation if the comparison fails, yet we > know cmpxchg (even the ones more narrow than cmpxchg16b) always writes, so > we must assume a load operation may write. I do not find a definition of a > "load operation" in the standard and [atomics.types.operations]/12 and 13 > avoid this term, saying that load() "Atomically returns the value pointed to > by this." Again, it doesn't say anything about writes to the memory. > > So, if compare_exchange_*() is allowed to write on failure, why load() > shouldn't be? Either compare_exchange_*() issuing writes is a bug (in which > case a lock-free CAS can't be implemented on x86 at all) or writes in load() > should be allowed and the change wrt. cmpxchg16b should be reverted. I think, there is way too much over-thinking about read-only case for 128-bit atomics. Current solution is very confusing and not very well documented at the very least. Correct me if I am wrong, but does current solution guarantee address-freedom? If not, what is the motivation to support 128-bit read only atomics? The only practical use case seems to be IPC where one process has a read-only access. If not guaranteed for 128-bit, why even bother to support read-only case which is a) not guaranteed to be lock-free b) works only within a single process where it is easy to control read-only behavior. I really prefer the way it was implemented in clang. It is only redirecting if -mcx16 is not specified. BTW, it also provides very nice implementation for Aarch64 which GCC is also lacking.
[Bug c++/107958] New: Ambiguity with uniform initialization in overloaded operator and explicit constructor
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107958 Bug ID: 107958 Summary: Ambiguity with uniform initialization in overloaded operator and explicit constructor Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: nruslan_devel at yahoo dot com Target Milestone: --- Suppose we have the following example with a class that has to keep two pointers. (I actually encountered this error with a more complex example, but this one is just for illustration.) The problem arises when I attempt to use the assignment operator and curly braces. If I understand correctly, two possibilities exist when passing curly braces: 1. Use the overloaded operator= (implicitly convert curly braces to a pair). In this particular example, we could have probably used make_pair but I deliberately put curly braces to show how this error is triggered. 2. Use the constructor to create a new PairPtr instance and then copy it to the old object through operator= Both clang and gcc complain unless I mark the corresponding constructor as 'explicit'. To avoid the ambiguity with the second case, I mark the corresponding constructor as 'explicit' and expect that the overloaded operator= to be used. That seems to work with clang/llvm but not with gcc (see the error below). #include #include struct PairPtr { PairPtr() {} PairPtr(const PairPtr &s) { a = s.a; b = s.b; } explicit PairPtr(int *_a, int *_b) { a = _a; b = _b; } PairPtr& operator=(const PairPtr &s) { a = s.a; b = s.b; return *this; } PairPtr& operator=(const std::pair& pair) { a = pair.first; b = pair.second; return *this; } int *a; int *b; }; void func(int *a, int *b) { PairPtr p; p = { a, b }; } Error (note that clang/llvm can compile the above code successfully!): Note that 'explicit' for the constructor fixes the problem for clang/llvm but does not fix the problem for gcc. 2.cpp: In function ‘void func(int*, int*)’: 2.cpp:38:20: error: ambiguous overload for ‘operator=’ (operand types are ‘PairPtr’ and ‘’) 38 | p = { a, b }; |^ 2.cpp:18:18: note: candidate: ‘PairPtr& PairPtr::operator=(const PairPtr&)’ 18 | PairPtr& operator=(const PairPtr &s) { | ^~~~ 2.cpp:24:18: note: candidate: ‘PairPtr& PairPtr::operator=(const std::pair&)’ 24 | PairPtr& operator=(const std::pair& pair) { |
[Bug c++/107958] Ambiguity with uniform initialization in overloaded operator and explicit constructor
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107958 --- Comment #9 from Ruslan Nikolaev --- Interestingly, if I change the code a little bit and have a pair in the constructor rather than two arguments, gcc seems to compile the code: #include #include struct PairPtr { PairPtr() {} PairPtr(const PairPtr &s) { a = s.a; b = s.b; } explicit PairPtr(const std::pair& pair) { a = pair.first; b = pair.second; } PairPtr& operator=(const PairPtr &s) { a = s.a; b = s.b; return *this; } PairPtr& operator=(const std::pair& pair) { a = pair.first; b = pair.second; return *this; } private: int *a; int *b; }; void func(int *a, int *b) { PairPtr p({a,b}); // works p = { a, b }; // also works }
[Bug c++/107958] Ambiguity with uniform initialization in overloaded operator and explicit constructor
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107958 --- Comment #10 from Ruslan Nikolaev --- The latter example seems to work well for both gcc and clang. The behavior is also consistent for both explicit and implicit constructors. Thank you for clarifying that it was not a bug!