[Bug target/80878] -mcx16 (enable 128 bit CAS) on x86_64 seems not to work on 7.1.0

2022-11-03 Thread admin_public at liblfds dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80878

--- Comment #40 from admin_public at liblfds dot org ---
On 03/11/2022 12:04, jakub at gcc dot gnu.org wrote:
> --- Comment #38 from Jakub Jelinek  ---
> Please see PR104688 .  We got a response from Intel, where they guaranteed
> atomicity of certain 16-byte load instructions for Intel CPUs with AVX 
> support.

Now, it's been a quite a long time since I've delved into lock-free, and I have
reason to doubt my earlier understanding anyway 
- so I may be *completely* wrong - but, as I recall, and, as I understood it,
the "usual" atomic operations (i.e. non-AVX) are 
essential in that they force the honouring of any previously issued read/write
barriers, as they force a read from, and write 
to, memory (well, I say memory - I mean to say, at least out to the cache
coherency protocol).

Will AVX do the same?

> The current state is that on the libatomic side when ifuncs are possible we 
> use
> those atomic loads etc. on Intel with AVX, and do what we used to do before 
> for
> other CPUs.

Yes.  As I recall, this is the problem for me - if such lock-free is not
available, mutexes or some such as used instead, and 
this is absolutely *not* okay, because their properties are completely
different; if I have a lock-free data structure and I'm 
using it in the kernel and I'm not allowed to sleep, I *can't* use a
sleep-based locking mechanism.

Lock-free has unique properties and when those properties are needed, if they
are not available, the only option is to fail to 
compile/build/run.

[Bug target/80878] -mcx16 (enable 128 bit CAS) on x86_64 seems not to work on 7.1.0

2023-12-10 Thread admin_public at liblfds dot org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80878

--- Comment #43 from admin_public at liblfds dot org ---
> I tested CMPXCHG16B with inline assembly on an i7-1165G7 (Dell XPS 13 9305) 
> and it turned out to be much slower than CMPXCHG, even slower than a pair of 
> calls to `pthread_mutex_lock()` and unlock.

Mutexes are faster when single threaded and there's no contention to the
locking object.  Compare-exchange (8 or 16) is much faster (orders of magnitude
faster) as contention rises.

Sometimes you need CAS 16, rather than CAS 8, due to the implementation
requirements of lock free data structures.