https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99613
--- Comment #7 from Jakub Jelinek <jakub at gcc dot gnu.org> --- Swapping __cxa_guard_release with __cxa_atexit would "fix" the case where the user program would in all threads access all the local variables in the same order. So all threads first access f<0>(), then f<1>(), etc. Even in that case right now it can happen that say thread0 wins the __cxa_guard_acquire for f<0>(), calls constructor, releases guard, then sleeps, and thread1 sees f<0>() is already initialized, wins __cxa_guard_acquire for f<1>(), calls constructor, releases guard, calls __cxa_atexit, then thread0 wakes up and calls __cxa_atexit. But, unless I'm misreading the testcase, that is not what that test is doing, there each thread calls just one of the f<N>() functions, so I don't see how the swapping would help there. Only using a global lock for all the local statics would fix that, and I think there is no way we are willing to do that (Jon, please correct me if I'm wrong).