> On 3 Sep 2025, at 20:54, Iain Buclaw <ibuc...@gdcproject.org> wrote:
> 
> Excerpts from Iain Buclaw's message of September 3, 2025 9:19 pm:
>> Excerpts from Rainer Orth's message of September 3, 2025 10:20 am:
>>>>> 
>>>>> I regularly (but not always) see timeouts on Solaris, both on sparc and
>>>>> x86:
>>>>> 
>>>>> WARNING: libphobos.gc/forkgc2.d execution test program timed out.
>>>>> FAIL: libphobos.gc/forkgc2.d execution test
>>>>> WARNING: libphobos.gc/startbackgc.d execution test program timed out.
>>>>> FAIL: libphobos.gc/startbackgc.d execution test
>>> 
>>> I haven't tried investigating what's wrong on Solaris with those two,
>>> but they sure are annoying, especially since they are so unreliable:
>>> sometimes both PASS, sometimes one or the other, sometimes both.
>>> 
>>> I'd thought about skipping them on Solaris, too, just to avoid the noise
>>> and the timeouts, but haven't gotten around to that.
>>> 
>>> However, fixing this at the root would certainly be best.
>>> 
>> 
>> I currently have a gdb session on cfarm, process has hung for forkgc2, 
>> and just looking at the backtrace.
>> 
>> * There are 11 threads in total (main + 10 new'd Threads)
>> * All threads are suspended (in sigsuspend) except for two
>> * The first of those threads is the one that's requested all threads to 
>>  suspend using pthread_kill(SIGRTMIN), and is stuck inside a sem_wait 
>>  for one more call to sem_post().
>> * The second is stuck in a SpinLock.lock loop, called from 
>>  _prefork_handler() inside forkx() inside fork() - my guess would be 
>>  the  handler being called is _d_gcx_atfork_prepare().
>> * Specific to Solaris, I've clocked this line in the forkx 
>>  implementation:
>> 
>> https://github.com/illumos/illumos-gate/blob/a21856a054bd854f39d1d55a6b0d547cb0d2039f/usr/src/lib/libc/port/threads/scalls.c#L177
>> 
>> I think what's going on is that the thread that wants to do a GC 
>> collection has issued a signal to all threads, but Solaris has called 
>> sigoff() in the last thread being fork'd, so the signal never reaches.
>> 
>> This behaviour does not change when COLLECT_FORK is disabled, so Solaris 
>> would still be affected.
>> 
> 
> I forgot to mention, thread #1 that wants to do a GC has control of the 
> SpinLock.  So that's why thread #2 is stuck in its current loop.
> 
> The order of operations that lead to Solaris hanging at runtime are:
> 1. Thread #1 calls GC.lockNR() and has hold of the global GC SpinLock.
> 2. Thread #2 calls fork(). It too calls GC.lockNR() in 
>   _d_gcx_atfork_prepare() and is waiting for the global lock.
> 3. Thread #1 decides to call thread_suspendAll() and will never release 
>   the GC lock until all threads are suspended.
> 4. Thread #2 will never suspend because Solaris has set sigoff() on it 
>   until the pthread_atfork prepare handler has returned (it won't).
> 
> It would appear that there should be some other fine grained lock to 
> prevent this kind of deadlock.

It’s not impossible to imagine something similar happening for Darwin.
(i.e. masking signals during thread startup) - but I did not poke at the
sources so far.
Iain

Reply via email to