https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110016

--- Comment #14 from Rachel Mant <rachel at rachelmant dot com> ---
(In reply to Andrew Pinski from comment #12)
> Let me try again to show the exact events of why I think this is not a
> libstdc++/GCC bug here.
> 
> 
> time      thread/core 1                       thread/core N
> -1                                            grab the mutex
> 0         atomically load waitingThreads      atomically increment
> waitingThreads
> 1         compare  waitingThreads             atomically load finished
> 2         atomically set finished to 1        atomically load work.empty()
> (queueLength)
> 3         start of notify_all                 branch on finished/queueLength
> 4         ...(some code before ...)           start on haveWork.wait
> 5         notifies all threads finished       .....(some more before ...)
> 6         .....                               waiting now
> 7         starts of joins                     still inside wait
> 8         joins hit thread N                  still inside wait
> etc.
> 
> You will notice the ordering of loading finished and the wait (and setting
> of finished and notify_all) is exactly ordered as you expect them to be
> ordered with memory_order_seq_cst on each of the core; that is there is no
> reordering going on each thread/core. It is still strictly ordered even. 
> 
> The reason why maybe libstdc++ exposes this is that the wait implemention
> checks the predicate before it goes into wait system call. or the time
> between the start of the call of notify_all call starts and the
> notifications go out is shorter than the time it takes to after the atomic
> load of finished happenes and the wait system call happens.
> 
> Since on thread 1, updating finished to 1  and notify_all is not done
> atomically (together), a thread could have read finished before the update
> and get into the wait loop after the notifications have gone out.
> 
> It is very similar to a TOCTOU issue because the use of the idea of finished
> is the wait itself rather than the comparison. and setting of finished and
> notify are done atomically (together); right now there is only an atomic
> ordering of the two.

Thank you for the clear run-through of the series of events you see leading to
the deadlock. That's very helpful.

To properly understand this problem space, why do you think locking the mutex
before setting `finished` is sufficient to fix this? It feels to us like it
shouldn't, and should only mask the bug, making it less likely to trigger?

Reply via email to