Re: [RFC] Bridging the gap between the Linux Kernel Memory Consistency Model (LKMM) and C11/C++11 atomics
On 7/5/23 03:05, Boqun Feng wrote: On Mon, Jul 03, 2023 at 03:20:31PM -0400, Olivier Dion wrote: [...] NOTE: On x86-64, we found at least one corner case [7] with Clang where a RELEASE exchange is optimized to a RELEASE store, when the returned value of the exchange is unused, breaking the above expectations. Although this type of optimization respect the standard "as-if" statement, we question its pertinence since a user should simply do a RELEASE store instead of an exchange in that case. With the introduction of these new primitives, these type of optimizations should be revisited. FWIW, this is actually a LLVM bug: https://github.com/llvm/llvm-project/issues/60418 So it was more than a dubious optimization, it's actually broken as well. I am worried about adding to the compiler's ability to optimize those atomics because of the subtle corner-cases/bugs that can creep up. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
Re: [RFC] Bridging the gap between the Linux Kernel Memory Consistency Model (LKMM) and C11/C++11 atomics
On 7/4/23 06:23, Jonathan Wakely wrote: On Tue, 4 Jul 2023 at 10:47, Peter Zijlstra wrote: On Mon, Jul 03, 2023 at 03:20:31PM -0400, Olivier Dion wrote: int x = 0; int y = 0; int r0, r1; int dummy; void t0(void) { __atomic_store_n(&x, 1, __ATOMIC_RELAXED); __atomic_exchange_n(&dummy, 1, __ATOMIC_SEQ_CST); __atomic_thread_fence(__ATOMIC_SEQ_CST); r0 = __atomic_load_n(&y, __ATOMIC_RELAXED); } void t1(void) { __atomic_store_n(&y, 1, __ATOMIC_RELAXED); __atomic_thread_fence(__ATOMIC_SEQ_CST); r1 = __atomic_load_n(&x, __ATOMIC_RELAXED); } // BUG_ON(r0 == 0 && r1 == 0) On x86-64 (gcc 13.1 -O2) we get: t0(): movl$1, x(%rip) movl$1, %eax xchgl dummy(%rip), %eax lock orq $0, (%rsp) ;; Redundant with previous exchange. movly(%rip), %eax movl%eax, r0(%rip) ret t1(): movl$1, y(%rip) lock orq $0, (%rsp) movlx(%rip), %eax movl%eax, r1(%rip) ret So I would expect the compilers to do better here. It should know those __atomic_thread_fence() thingies are superfluous and simply not emit them. This could even be done as a peephole pass later, where it sees consecutive atomic ops and the second being a no-op. Right, I don't see why we need a whole set of new built-ins that say "this fence isn't needed if the adjacent atomic op already implies a fence". If the adjacent atomic op already implies a fence for a given ISA, then the compiler should already be able to elide the explicit fence. So just write your code with the explicit fence, and rely on the compiler to optimize it properly. Admittedly, today's compilers don't do that optimization well, but they also don't support your proposed built-ins, so you're going to have to wait for compilers to make improvements either way. Emitting the redundant fences is the plan we have for liburcu. The current situation unfortunately requires users to choose between generation of inefficient code with C11 or implement their own inline assembler until the compilers catch up. https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4455.html discusses that compilers could (and should) optimize around atomics better. Our understanding of the C11/C++11 memory model is that it aims at defining the weakest possible guarantees for each ordering to be as efficient as possible on weakly ordered architectures. However, when writing portable code in practice, the C11/C++11 memory model force the programmer to insert memory fences which are redundant on strongly ordered architectures. We want something that can apply across procedures from different modules: e.g. a mutex lock operation (glibc) has an acquire semantic using a RMW operation that the caller could promote to a full fence. The peephole optimizations cannot do this because they focus on a single basic block. PRE can apply across procedures, but would rely on LTO and possibly function annotation across modules. I am not aware of any progress in that research field in the past 6 years. [1-2] The new atomic builtins we propose allow the user to better express its intent to the compiler, allowing for better code generation. Therefore, reducing the number of emitted redundant fences, without having to rely on optimizations. It should be noted that the builtins extensions we propose are not entirely free. Here are our perceived downsides of introducing those APIs: - They add complexity to the atomic builtins API. - They add constraints which need to be taken into account for future architecture-specific backend optimizations, as an example the (broken) xchg RELEASE | RELAXED -> store on x86 (Clang) [3]. If an atomic op class (e.g. rmw) can be optimized to a weaker instruction by the architecture backend, then the emission of a before/after-fence associated with this class of atomic op, must be pessimistic and assume the weakest instruction pattern which can be generated. There are optimizations of atomics and redundant fences in Clang. The redundant fences optimizations appear to be limited to a peephole, which does not appear to leverage the fact that lock-prefixed atomic operations act as implicit fences on x86. Perhaps this could be a low-hanging fruit for optimization. We have not observed any similar optimizations in gcc as of today, which appears to be a concern for many users. [4-7] Thanks, Mathieu [1] https://dl.acm.org/doi/10.1145/3033019.3033021 [2] https://reviews.llvm.org/D5758 [3] https://github.com/llvm/llvm-project/issues/60418 [4] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86056 [5] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68622 [6] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86072 [7] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63273 -- Mathieu Desnoyers EfficiOS Inc. https://www.e
Re: [RFC] Bridging the gap between the Linux Kernel Memory Consistency Model (LKMM) and C11/C++11 atomics
On 7/10/23 10:32, Jonas Oberhauser wrote: Am 7/7/2023 um 7:25 PM schrieb Olivier Dion: On Fri, 07 Jul 2023, Jonas Oberhauser wrote: [...] This is a request for comments on extending the atomic builtins API to help avoiding redundant memory barriers. Indeed, there are discrepancies between the Linux kernel consistency memory model (LKMM) and the C11/C++11 memory consistency model [0]. For example, fully-ordered atomic operations like xchg and cmpxchg success in LKMM have implicit memory barriers before/after the operations [1-2], while atomic operations using the __ATOMIC_SEQ_CST memory order in C11/C++11 do not have any ordering guarantees of an atomic thread fence __ATOMIC_SEQ_CST with respect to other non-SEQ_CST operations [3]. The issues run quite a bit deeper than this. The two models have two completely different perspectives that are quite much incompatible. Agreed. Our intent is not to close the gap completely, but to reduce the gap between the two models, by supporting the "full barrier before/after" semantic of LKMM in the C11/C++11 memory model. Hi Jonas, I think what you're trying to achieve has nothing to do with the gap at all. (But do check out the IMM paper https://plv.mpi-sws.org/imm/ for what is involved in bridging the gap between LKMM-like and C11-like models). What you're trying to achieve is to certify some urcu algorithms, without making the code unnecessarily slower. You are correct. Ideally we also want to move away from maintaining our own implementation of atomic operations in liburcu and just rely on compiler atomic builtins to reduce overall complexity. Your problem is that the algorithm is implemented using the LKMM API, and you want to check it with a tool (TSAN) meant to (dynamically) analyze C11 code that relies on a subset of C11's memory model. There are two aspects to this: 1) The first aspect is with respect to the implementation of liburcu per se. We have recently implemented changes to the liburcu master branch which add support for C11 atomics and modify the liburcu algorithms to add all hb relationships needed to make TSAN understand liburcu's RCU and lock-free data structures implementation. 2) The second aspect is maintaining compatibility of a public API for external users of liburcu. This is the main part that hurts when users are building liburcu using the new "--enable-compiler-atomic-builtins" configure option, because external users relying on LKMM style atomic operations will observe a performance degradation due to those duplicated barriers. This is the main target of the proposal discussed here. What I still don't understand is whether using TSAN as-is is a formal requirement from the certification you are trying to achieve, or whether you could either slightly modify the TSAN toolchain to give answers consistent with the behavior on LKMM, or use a completely different tool. Our requirement comes from ISC for their use of liburcu as a dependency of the BIND9 DNS server. They use TSAN to validate BIND9, and asked us to work on making liburcu's implementation understood by TSAN. In our evaluation, the most direct route towards this goal was to implement liburcu atomics in terms of C11 atomics, and adapt liburcu algorithms to express hb relationships (e.g. release/acquire) in terms of C11 where they were unclear. This effort has various benefits, including: - Add ability to validate liburcu's implementation and use with TSAN, - Self-document liburcu's code by clearly expressing hb relationships, - Reduce overall complexity by using C11 atomics rather than maintaining custom assembler for each supported architecture. For example, you could eliminate the worry about the unnecessary barriers by including the extra barriers only in the TSAN' (modified TSAN) analysis. In that case TSAN' adds additional, redundant barriers in some cases during the analysis process, but those barriers would be gone the moment you stop using TSAN'. You would need to argue that this additional instrumentation doesn't hide any data races, but I suspect that wouldn't be too hard. We have made the changes to liburcu that were needed to make TSAN happy, and in order to make sure we let it keep track of hb relationships within the liburcu implementation, this involved introducing release/acquire relationship semantic in the code. I prefer this fine-grained approach to the big hammer of adding extra TSAN barriers because the latter approach prevents TSAN from having a deep understanding of the hb relationships (AFAIU it can introduce false-negatives). Another possibility is to use a tool like Dat3M that supports LKMM to try and verify your code, but I'm not sure if these tools are feature-complete enough to verify the specific algorithms you have in mind (e.g., mixed-size accesses are an issue, and Paul told me there's a few of those in (U)RCU. But maybe the cost of changing the code to full-sized accesses might be cheaper
Re: [lttng-dev] New TLS usage in libgcc_s.so.1, compatibility impact
On 2024-01-13 07:49, Florian Weimer via lttng-dev wrote: This commit commit 8abddb187b33480d8827f44ec655f45734a1749d Author: Andrew Burgess Date: Sat Aug 5 14:31:06 2023 +0200 libgcc: support heap-based trampolines Add support for heap-based trampolines on x86_64-linux, aarch64-linux, and x86_64-darwin. Implement the __builtin_nested_func_ptr_created and __builtin_nested_func_ptr_deleted functions for these targets. Co-Authored-By: Maxim Blinov Co-Authored-By: Iain Sandoe Co-Authored-By: Francois-Xavier Coudert added TLS usage to libgcc_s.so.1. The way that libgcc_s is currently built, it ends up using a dynamic TLS variant on the Linux targets. This means that there is no up-front TLS allocation with glibc (but there would be one with musl). Trying to wrap my head around this: If I get this right, the previous behavior was that glibc did allocate global-dynamic variables from libraries which are preloaded and loaded on c startup as if they were initial-exec, but now that libgcc_s.so.1 has a dynamic TLS variable, all those libraries loaded on c startup that have global-dynamic TLS do not get the initial allocation special treatment anymore. Is that more or less correct ? (note: it's entirely possible that my understanding is entirely wrong, please correct me if it's the case) There is still a compatibility impact because glibc assigns a TLS module ID upfront. This seems to be what causes the ust/libc-wrapper/test_libc-wrapper test in lttng-tools to fail. We end up with an infinite regress during process termination because libgcc_s.so.1 has been loaded, resulting in a DTV update. When this happens, the bottom of the stack looks like this: #4447 0x77f288f0 in free () from /lib64/liblttng-ust-libc-wrapper.so.1 #4448 0x77fdb142 in free (ptr=) at ../include/rtld-malloc.h:50 #4449 _dl_update_slotinfo (req_modid=3, new_gen=2) at ../elf/dl-tls.c:822 #4450 0x77fdb214 in update_get_addr (ti=0x77f2bfc0, gen=) at ../elf/dl-tls.c:916 #4451 0x77fddccc in __tls_get_addr () at ../sysdeps/x86_64/tls_get_addr.S:55 #4452 0x77f288f0 in free () from /lib64/liblttng-ust-libc-wrapper.so.1 #4453 0x77fdb142 in free (ptr=) at ../include/rtld-malloc.h:50 #4454 _dl_update_slotinfo (req_modid=2, new_gen=2) at ../elf/dl-tls.c:822 #4455 0x77fdb214 in update_get_addr (ti=0x77f39fa0, gen=) at ../elf/dl-tls.c:916 #4456 0x77fddccc in __tls_get_addr () at ../sysdeps/x86_64/tls_get_addr.S:55 #4457 0x77f36113 in lttng_ust_cancelstate_disable_push () from /lib64/liblttng-ust-common.so.1 #4458 0x77f4c2e8 in ust_lock_nocheck () from /lib64/liblttng-ust.so.1 #4459 0x77f5175a in lttng_ust_cleanup () from /lib64/liblttng-ust.so.1 #4460 0x77fca0f2 in _dl_call_fini ( closure_map=closure_map@entry=0x77fbe000) at dl-call_fini.c:43 #4461 0x77fce06e in _dl_fini () at dl-fini.c:114 #4462 0x77d82fe6 in __run_exit_handlers () from /lib64/libc.so.6 Cc:ing for awareness. I've prepared a change for lttng-ust to move the lttng-ust libc wrapper "malloc nesting" guard variable from global-dynamic to initial-exec: https://review.lttng.org/c/lttng-ust/+/11677 Fix: libc wrapper: use initial-exec for malloc_nesting TLS This should help for the infinite recursion issue, but if my understanding is correct about the impact of effectively changing the behavior used for global-dynamic variables in preloaded and on-startup-loaded libraries introduced by this libgcc change, I suspect we have other new issues here, such as problems with async-signal safety of other global-dynamic variables within LTTng-UST. But moving all TLS variables used by lttng-ust from global-dynamic to initial-exec is tricky, because a prior attempt to do so introduced regressions in use-cases where lttng-ust was dlopen'd by Java or Python, AFAIU situations where the runtimes were already using most of the extra memory pool for dlopen'd libraries initial-exec variables, causing dlopen of lttng-ust to fail. Thanks Florian for letting us know about this, Mathieu The issue also requires a recent glibc with changes to DTV management: commit d2123d68275acc0f061e73d5f86ca504e0d5a344 ("elf: Fix slow tls access after dlopen [BZ #19924]"). If I understand things correctly, before this glibc change, we didn't deallocate the old DTV, so there was no call to the free function. On the glibc side, we should recommend that intercepting mallocs and its dependencies use initial-exec TLS because that kind of TLS does not use malloc. If intercepting mallocs using dynamic TLS work at all, that's totally by accident, and was in the past helped by glibc bug 19924. (I don't think there is anything special about libgcc_s.so.1 that triggers the test failure above, it is just an object with dynamic TLS that is implicitly loaded via dlopen at the right stage of the test.) In this
Re: [lttng-dev] New TLS usage in libgcc_s.so.1, compatibility impact
On 2024-01-15 14:42, Florian Weimer wrote: * Mathieu Desnoyers: [...] General use of lttng should be fine, I think, only the malloc wrapper has this problem. The purpose of the nesting counter TLS variable in the malloc wrapper is to catch situations like this where a global-dynamic TLS access (or any unexpected memory access done as a side-effect from calling libc) from within LTTng-UST instrumentation would internally attempt to call recursively into the malloc wrapper. In that nested case, we skip the instrumentation and call the libc function directly. I agree with your conclusion that only this nesting counter gating variable actually needs to be initial-exec. But moving all TLS variables used by lttng-ust from global-dynamic to initial-exec is tricky, because a prior attempt to do so introduced regressions in use-cases where lttng-ust was dlopen'd by Java or Python, AFAIU situations where the runtimes were already using most of the extra memory pool for dlopen'd libraries initial-exec variables, causing dlopen of lttng-ust to fail. Oh, right, that makes it quite difficult. Could you link a private copy of the libraries into the wrapper that uses initial-exec TLS? Unfortunately not easily, because by design LTTng-UST is meant to be a singleton per-process. Changing this would have far-reaching impacts on interactions with the LTTng-UST tracepoint instrumentation, as well as impacts on synchronization between the LTTng-UST agent thread and application calling fork/clone. Also AFAIR, the LTTng session daemon (at least until recently) does not expect multiple concurrent registrations from a given process. Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com