https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106772
--- Comment #21 from Thomas Rodgers <rodgertq at gcc dot gnu.org> --- (In reply to Mkkt Bkkt from comment #16) > > it with sarcasm > > I started with sarcasm because you restart this thread with some doubtful > benchmarks without code for them. > > I think it's very disrespectfully. I wasn't sarcastic in what I posted. As I noted, this question has come up before in different contexts, Bugzilla is a useful historical archive, so updating this issue with my reasoning and a bit of data was primarily a capture task. So, let's try this again. I did not post the original source because it required hacking on the libstdc++ headers. I have posted a version that does not require that, the results are identical. In this test, it is the example Jonathan cited in #14; incrementing an atomic int and calling notify. This isn't about semaphore or any other synchronization primitive. Those types are free to make different choices that may be more appropriate to the constrained usage of the type just like Lewis' lightweight manual reset event does (I will also note that Lewis has reviewed this implementation, and has written a paper to be discussed at the Fall meeting, p2616). There are, as Jonathan has pointed out, use cases where notify can and will be called without a notifier having any way to determine it will wake a waiter. One example I, as the person that is going to have to implement C++26 executors care about is a wait free work-stealing deque, it certainly doesn't require anything more than spinning for work on an empty queue to be algorithmically correct, but after spinning on an empty queue, making the rounds trying to steal work from other deques, maybe spinning a bit more, just to be sure, the de-queuing thread which hasn't been able to acquire more work is probably going to want to enter a wait until such time as it knows it can do productive work. Another thread pushing work into that queue won't be able to determine if the dequeue-ing thread is spinning for new work, work stealing, or has entered a wait, but atomic<int>::notify() does know that, and can avoid penalizing the thread submitting work with a syscall, if there is no thread in a wait on the other end of the deque, which is the expected case for this algorithm. p1135 was the paper that added atomic wait/notify. One of the co-authors of that paper wrote the libc++ implementation. That implementation, as with libstdc++, is not simply a wrapper over the underlying platform's wait primitive. It has two 'backends', an exponentially timed backoff and ulock wait/wake. libstdc++ currently has a Futex and condvar backend. Both implementations make the choice of having a short-term spinning strategy and long term waiting strategy (spinning, futex/ulock, condvar). I have confirmed with the libc++ implementation's author, (who also chairs the C++ committee's Concurrency and Parallelism study group), that it was never the intention of p1135 or the subsequently standardized language in C++20 to imply that wait/notify were direct portable analogs to the platform's waiting primitives. There are users, such as yourself that want exactly that, there are other users (like in my prior industry) where busy waiting is the desired strategy, and in between those two choices are people who want it to work as advertised in the standard, and to do so 'efficiently'. Both libc++ and libstdc++ take a balanced approach somewhere between always going to OS and always spinning. There is an open question here that your original issue raises - * At what point do collisions on the waiter pool, with the cache invalidation traffic and spurious wakeups that result, swamp the gain of doing this empty waiter check on notify? I also, made a comment about the 'size of the waiter pool not withstanding'. I chose a smaller size than libc++ chose, in part because Jonathan and I did not want to make any sort of ABI commitment until this had been in a few GCC releases. This implementation is header only at present, and still considered experimental. libc++ committed to an ABI early. In the sizing of the libc++ waiter pool there is the comment that 'there is no magic in this value'. Not only is there no magic, there is no test of any sort that I have done, or that has been done on libc++ to determine what effect different size pools have under different load scenarios. So, all of it is a guess at this point. I will likely match libc++ when I do move this into the .so. Finally, in the most recent mailing there is p2643 which proposes additional changes to atomic waiting. One proposal is to add a 'hinted' wait that can allow the caller to steer the choices atomic wait/notify uses. I have conferred with the other authors of the paper and this latter option is not without controversy, and likely some sharp edges for the user, but I plan to raise the discussion at the Fall WG21 meeting to see what the other members of SG1 think.