Re: [RFC] Bridging the gap between the Linux Kernel Memory Consistency Model (LKMM) and C11/C++11 atomics

2023-07-05 Thread Mathieu Desnoyers via Gcc

On 7/5/23 03:05, Boqun Feng wrote:

On Mon, Jul 03, 2023 at 03:20:31PM -0400, Olivier Dion wrote:
[...]

NOTE: On x86-64, we found at least one corner case [7] with Clang where
a RELEASE exchange is optimized to a RELEASE store, when the returned
value of the exchange is unused, breaking the above expectations.
Although this type of optimization respect the standard "as-if"
statement, we question its pertinence since a user should simply do a
RELEASE store instead of an exchange in that case.  With the
introduction of these new primitives, these type of optimizations should
be revisited.



FWIW, this is actually a LLVM bug:

https://github.com/llvm/llvm-project/issues/60418


So it was more than a dubious optimization, it's actually broken as well.

I am worried about adding to the compiler's ability to optimize those 
atomics because of the subtle corner-cases/bugs that can creep up.


Thanks,

Mathieu


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com



Re: [RFC] Bridging the gap between the Linux Kernel Memory Consistency Model (LKMM) and C11/C++11 atomics

2023-07-07 Thread Mathieu Desnoyers via Gcc

On 7/4/23 06:23, Jonathan Wakely wrote:

On Tue, 4 Jul 2023 at 10:47, Peter Zijlstra wrote:


On Mon, Jul 03, 2023 at 03:20:31PM -0400, Olivier Dion wrote:


   int x = 0;
   int y = 0;
   int r0, r1;

   int dummy;

   void t0(void)
   {
   __atomic_store_n(&x, 1, __ATOMIC_RELAXED);

   __atomic_exchange_n(&dummy, 1, __ATOMIC_SEQ_CST);
   __atomic_thread_fence(__ATOMIC_SEQ_CST);

   r0 = __atomic_load_n(&y, __ATOMIC_RELAXED);
   }

   void t1(void)
   {
   __atomic_store_n(&y, 1, __ATOMIC_RELAXED);
   __atomic_thread_fence(__ATOMIC_SEQ_CST);
   r1 = __atomic_load_n(&x, __ATOMIC_RELAXED);
   }

   // BUG_ON(r0 == 0 && r1 == 0)

On x86-64 (gcc 13.1 -O2) we get:

   t0():
   movl$1, x(%rip)
   movl$1, %eax
   xchgl   dummy(%rip), %eax
   lock orq $0, (%rsp)   ;; Redundant with previous exchange.
   movly(%rip), %eax
   movl%eax, r0(%rip)
   ret
   t1():
   movl$1, y(%rip)
   lock orq $0, (%rsp)
   movlx(%rip), %eax
   movl%eax, r1(%rip)
   ret


So I would expect the compilers to do better here. It should know those
__atomic_thread_fence() thingies are superfluous and simply not emit
them. This could even be done as a peephole pass later, where it sees
consecutive atomic ops and the second being a no-op.


Right, I don't see why we need a whole set of new built-ins that say
"this fence isn't needed if the adjacent atomic op already implies a
fence". If the adjacent atomic op already implies a fence for a given
ISA, then the compiler should already be able to elide the explicit
fence.

So just write your code with the explicit fence, and rely on the
compiler to optimize it properly. Admittedly, today's compilers don't
do that optimization well, but they also don't support your proposed
built-ins, so you're going to have to wait for compilers to make
improvements either way.


Emitting the redundant fences is the plan we have for liburcu.  The
current situation unfortunately requires users to choose between
generation of inefficient code with C11 or implement their own inline
assembler until the compilers catch up.



https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4455.html
discusses that compilers could (and should) optimize around atomics
better.


Our understanding of the C11/C++11 memory model is that it aims at
defining the weakest possible guarantees for each ordering to be as
efficient as possible on weakly ordered architectures.  However, when
writing portable code in practice, the C11/C++11 memory model force the
programmer to insert memory fences which are redundant on strongly
ordered architectures.

We want something that can apply across procedures from different
modules: e.g. a mutex lock operation (glibc) has an acquire semantic
using a RMW operation that the caller could promote to a full fence.
The peephole optimizations cannot do this because they focus on a single
basic block.  PRE can apply across procedures, but would rely on LTO and
possibly function annotation across modules.  I am not aware of any
progress in that research field in the past 6 years. [1-2]

The new atomic builtins we propose allow the user to better express its
intent to the compiler, allowing for better code generation.  Therefore,
reducing the number of emitted redundant fences, without having to rely on
optimizations.

It should be noted that the builtins extensions we propose are not
entirely free.  Here are our perceived downsides of introducing those
APIs:

- They add complexity to the atomic builtins API.

- They add constraints which need to be taken into account for future
  architecture-specific backend optimizations, as an example the (broken)
  xchg RELEASE | RELAXED -> store on x86 (Clang) [3].

  If an atomic op class (e.g. rmw) can be optimized to a weaker
  instruction by the architecture backend, then the emission of a
  before/after-fence associated with this class of atomic op, must be
  pessimistic and assume the weakest instruction pattern which can
  be generated.

There are optimizations of atomics and redundant fences in Clang.  The
redundant fences optimizations appear to be limited to a peephole, which
does not appear to leverage the fact that lock-prefixed atomic
operations act as implicit fences on x86.  Perhaps this could be a
low-hanging fruit for optimization.

We have not observed any similar optimizations in gcc as of today, which
appears to be a concern for many users. [4-7]

Thanks,

Mathieu

[1] https://dl.acm.org/doi/10.1145/3033019.3033021
[2] https://reviews.llvm.org/D5758
[3] https://github.com/llvm/llvm-project/issues/60418
[4] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86056
[5] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68622
[6] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86072
[7] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63273

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.e

Re: [RFC] Bridging the gap between the Linux Kernel Memory Consistency Model (LKMM) and C11/C++11 atomics

2023-08-16 Thread Mathieu Desnoyers via Gcc

On 7/10/23 10:32, Jonas Oberhauser wrote:


Am 7/7/2023 um 7:25 PM schrieb Olivier Dion:
On Fri, 07 Jul 2023, Jonas Oberhauser 
 wrote:

[...]

This is a request for comments on extending the atomic builtins API to
help avoiding redundant memory barriers.  Indeed, there are
discrepancies between the Linux kernel consistency memory model (LKMM)
and the C11/C++11 memory consistency model [0].  For example,
fully-ordered atomic operations like xchg and cmpxchg success in LKMM
have implicit memory barriers before/after the operations [1-2], while
atomic operations using the __ATOMIC_SEQ_CST memory order in C11/C++11
do not have any ordering guarantees of an atomic thread fence
__ATOMIC_SEQ_CST with respect to other non-SEQ_CST operations [3].


The issues run quite a bit deeper than this. The two models have two
completely different perspectives that are quite much incompatible.

Agreed.  Our intent is not to close the gap completely, but to reduce
the gap between the two models, by supporting the "full barrier
before/after" semantic of LKMM in the C11/C++11 memory model.




Hi Jonas,



I think what you're trying to achieve has nothing to do with the gap at 
all. (But do check out the IMM paper https://plv.mpi-sws.org/imm/ for 
what is involved in bridging the gap between LKMM-like and C11-like 
models).


What you're trying to achieve is to certify some urcu algorithms, 
without making the code unnecessarily slower.


You are correct. Ideally we also want to move away from maintaining our
own implementation of atomic operations in liburcu and just rely on
compiler atomic builtins to reduce overall complexity.

Your problem is that the algorithm is implemented using the LKMM API, 
and you want to check it with a tool (TSAN) meant to (dynamically) 
analyze C11 code that relies on a subset of C11's memory model.


There are two aspects to this:

1) The first aspect is with respect to the implementation of liburcu per
se. We have recently implemented changes to the liburcu master branch
which add support for C11 atomics and modify the liburcu algorithms to
add all hb relationships needed to make TSAN understand liburcu's RCU
and lock-free data structures implementation.

2) The second aspect is maintaining compatibility of a public API for
external users of liburcu. This is the main part that hurts when users
are building liburcu using the new "--enable-compiler-atomic-builtins"
configure option, because external users relying on LKMM style
atomic operations will observe a performance degradation due to those
duplicated barriers. This is the main target of the proposal discussed
here.



What I still don't understand is whether using TSAN as-is is a formal 
requirement from the certification you are trying to achieve, or whether 
you could either slightly modify the TSAN toolchain to give answers 
consistent with the behavior on LKMM, or use a completely different tool.


Our requirement comes from ISC for their use of liburcu as a dependency
of the BIND9 DNS server. They use TSAN to validate BIND9, and asked us
to work on making liburcu's implementation understood by TSAN.

In our evaluation, the most direct route towards this goal was to
implement liburcu atomics in terms of C11 atomics, and adapt liburcu
algorithms to express hb relationships (e.g. release/acquire) in terms
of C11 where they were unclear.

This effort has various benefits, including:

- Add ability to validate liburcu's implementation and use with TSAN,
- Self-document liburcu's code by clearly expressing hb relationships,
- Reduce overall complexity by using C11 atomics rather than maintaining
  custom assembler for each supported architecture.



For example, you could eliminate the worry about the unnecessary 
barriers by including the extra barriers only in the TSAN' (modified 
TSAN) analysis.
In that case TSAN' adds additional, redundant barriers in some cases 
during the analysis process, but those barriers would be gone the moment 
you stop using TSAN'.


You would need to argue that this additional instrumentation doesn't 
hide any data races, but I suspect that wouldn't be too hard.


We have made the changes to liburcu that were needed to make TSAN happy,
and in order to make sure we let it keep track of hb relationships
within the liburcu implementation, this involved introducing
release/acquire relationship semantic in the code. I prefer this
fine-grained approach to the big hammer of adding extra TSAN barriers
because the latter approach prevents TSAN from having a deep
understanding of the hb relationships (AFAIU it can introduce
false-negatives).



Another possibility is to use a tool like Dat3M that supports LKMM to 
try and verify your code, but I'm not sure if these tools are 
feature-complete enough to verify the specific algorithms you have in 
mind (e.g., mixed-size accesses are an issue, and Paul told me there's a 
few of those in (U)RCU. But maybe the cost of changing the code to 
full-sized accesses might be cheaper

Re: [lttng-dev] New TLS usage in libgcc_s.so.1, compatibility impact

2024-01-15 Thread Mathieu Desnoyers via Gcc

On 2024-01-13 07:49, Florian Weimer via lttng-dev wrote:

This commit

commit 8abddb187b33480d8827f44ec655f45734a1749d
Author: Andrew Burgess 
Date:   Sat Aug 5 14:31:06 2023 +0200

 libgcc: support heap-based trampolines
 
 Add support for heap-based trampolines on x86_64-linux, aarch64-linux,

 and x86_64-darwin. Implement the __builtin_nested_func_ptr_created and
 __builtin_nested_func_ptr_deleted functions for these targets.
 
 Co-Authored-By: Maxim Blinov 

 Co-Authored-By: Iain Sandoe 
 Co-Authored-By: Francois-Xavier Coudert 

added TLS usage to libgcc_s.so.1.  The way that libgcc_s is currently
built, it ends up using a dynamic TLS variant on the Linux targets.
This means that there is no up-front TLS allocation with glibc (but
there would be one with musl).


Trying to wrap my head around this:

If I get this right, the previous behavior was that glibc did allocate
global-dynamic variables from libraries which are preloaded and loaded
on c startup as if they were initial-exec, but now that libgcc_s.so.1
has a dynamic TLS variable, all those libraries loaded on c startup that
have global-dynamic TLS do not get the initial allocation special
treatment anymore. Is that more or less correct ?

(note: it's entirely possible that my understanding is entirely wrong,
please correct me if it's the case)



There is still a compatibility impact because glibc assigns a TLS module
ID upfront.  This seems to be what causes the
ust/libc-wrapper/test_libc-wrapper test in lttng-tools to fail.  We end
up with an infinite regress during process termination because
libgcc_s.so.1 has been loaded, resulting in a DTV update.  When this
happens, the bottom of the stack looks like this:

#4447 0x77f288f0 in free () from /lib64/liblttng-ust-libc-wrapper.so.1
#4448 0x77fdb142 in free (ptr=)
 at ../include/rtld-malloc.h:50
#4449 _dl_update_slotinfo (req_modid=3, new_gen=2) at ../elf/dl-tls.c:822
#4450 0x77fdb214 in update_get_addr (ti=0x77f2bfc0,
 gen=) at ../elf/dl-tls.c:916
#4451 0x77fddccc in __tls_get_addr ()
 at ../sysdeps/x86_64/tls_get_addr.S:55
#4452 0x77f288f0 in free () from /lib64/liblttng-ust-libc-wrapper.so.1
#4453 0x77fdb142 in free (ptr=)
 at ../include/rtld-malloc.h:50
#4454 _dl_update_slotinfo (req_modid=2, new_gen=2) at ../elf/dl-tls.c:822
#4455 0x77fdb214 in update_get_addr (ti=0x77f39fa0,
 gen=) at ../elf/dl-tls.c:916
#4456 0x77fddccc in __tls_get_addr ()
 at ../sysdeps/x86_64/tls_get_addr.S:55
#4457 0x77f36113 in lttng_ust_cancelstate_disable_push ()
from /lib64/liblttng-ust-common.so.1
#4458 0x77f4c2e8 in ust_lock_nocheck () from /lib64/liblttng-ust.so.1
#4459 0x77f5175a in lttng_ust_cleanup () from /lib64/liblttng-ust.so.1
#4460 0x77fca0f2 in _dl_call_fini (
 closure_map=closure_map@entry=0x77fbe000) at dl-call_fini.c:43
#4461 0x77fce06e in _dl_fini () at dl-fini.c:114
#4462 0x77d82fe6 in __run_exit_handlers () from /lib64/libc.so.6

Cc:ing  for awareness.


I've prepared a change for lttng-ust to move the lttng-ust libc wrapper
"malloc nesting" guard variable from global-dynamic to initial-exec:

https://review.lttng.org/c/lttng-ust/+/11677 Fix: libc wrapper: use 
initial-exec for malloc_nesting TLS

This should help for the infinite recursion issue, but if my understanding
is correct about the impact of effectively changing the behavior used
for global-dynamic variables in preloaded and on-startup-loaded libraries
introduced by this libgcc change, I suspect we have other new issues here,
such as problems with async-signal safety of other global-dynamic variables
within LTTng-UST.

But moving all TLS variables used by lttng-ust from global-dynamic to
initial-exec is tricky, because a prior attempt to do so introduced regressions
in use-cases where lttng-ust was dlopen'd by Java or Python, AFAIU situations
where the runtimes were already using most of the extra memory pool for
dlopen'd libraries initial-exec variables, causing dlopen of lttng-ust
to fail.

Thanks Florian for letting us know about this,

Mathieu



The issue also requires a recent glibc with changes to DTV management:
commit d2123d68275acc0f061e73d5f86ca504e0d5a344 ("elf: Fix slow tls
access after dlopen [BZ #19924]").  If I understand things correctly,
before this glibc change, we didn't deallocate the old DTV, so there was
no call to the free function.

On the glibc side, we should recommend that intercepting mallocs and its
dependencies use initial-exec TLS because that kind of TLS does not use
malloc.  If intercepting mallocs using dynamic TLS work at all, that's
totally by accident, and was in the past helped by glibc bug 19924.  (I
don't think there is anything special about libgcc_s.so.1 that triggers
the test failure above, it is just an object with dynamic TLS that is
implicitly loaded via dlopen at the right stage of the test.)  In this

Re: [lttng-dev] New TLS usage in libgcc_s.so.1, compatibility impact

2024-01-15 Thread Mathieu Desnoyers via Gcc

On 2024-01-15 14:42, Florian Weimer wrote:

* Mathieu Desnoyers:


[...]


General use of lttng should be fine, I think, only the malloc wrapper
has this problem.


The purpose of the nesting counter TLS variable in the malloc wrapper
is to catch situations like this where a global-dynamic TLS access
(or any unexpected memory access done as a side-effect from calling
libc) from within LTTng-UST instrumentation would internally attempt to
call recursively into the malloc wrapper. In that nested case, we skip
the instrumentation and call the libc function directly.

I agree with your conclusion that only this nesting counter gating variable
actually needs to be initial-exec.




But moving all TLS variables used by lttng-ust from global-dynamic to
initial-exec is tricky, because a prior attempt to do so introduced
regressions in use-cases where lttng-ust was dlopen'd by Java or
Python, AFAIU situations where the runtimes were already using most of
the extra memory pool for dlopen'd libraries initial-exec variables,
causing dlopen of lttng-ust to fail.


Oh, right, that makes it quite difficult.  Could you link a private copy
of the libraries into the wrapper that uses initial-exec TLS?


Unfortunately not easily, because by design LTTng-UST is meant to be a
singleton per-process. Changing this would have far-reaching impacts on
interactions with the LTTng-UST tracepoint instrumentation, as well as
impacts on synchronization between the LTTng-UST agent thread and
application calling fork/clone. Also AFAIR, the LTTng session daemon
(at least until recently) does not expect multiple concurrent
registrations from a given process.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com