https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110016
--- Comment #12 from Andrew Pinski <pinskia at gcc dot gnu.org> --- Let me try again to show the exact events of why I think this is not a libstdc++/GCC bug here. time thread/core 1 thread/core N -1 grab the mutex 0 atomically load waitingThreads atomically increment waitingThreads 1 compare waitingThreads atomically load finished 2 atomically set finished to 1 atomically load work.empty() (queueLength) 3 start of notify_all branch on finished/queueLength 4 ...(some code before ...) start on haveWork.wait 5 notifies all threads finished .....(some more before ...) 6 ..... waiting now 7 starts of joins still inside wait 8 joins hit thread N still inside wait etc. You will notice the ordering of loading finished and the wait (and setting of finished and notify_all) is exactly ordered as you expect them to be ordered with memory_order_seq_cst on each of the core; that is there is no reordering going on each thread/core. It is still strictly ordered even. The reason why maybe libstdc++ exposes this is that the wait implemention checks the predicate before it goes into wait system call. or the time between the start of the call of notify_all call starts and the notifications go out is shorter than the time it takes to after the atomic load of finished happenes and the wait system call happens. Since on thread 1, updating finished to 1 and notify_all is not done atomically (together), a thread could have read finished before the update and get into the wait loop after the notifications have gone out. It is very similar to a TOCTOU issue because the use of the idea of finished is the wait itself rather than the comparison. and setting of finished and notify are done atomically (together); right now there is only an atomic ordering of the two.