Re: [lldb-dev] The lit test driver gets killed because of SIGHUP

Raphael Isemann via lldb-dev Wed, 05 Dec 2018 05:01:46 -0800

@Jonas: Did you confirm it is SIGHUP? I remember that we were not sure
whether the signal kind was SIGHUP or SIGINT.


- Raphael
Am Mi., 5. Dez. 2018 um 10:25 Uhr schrieb Pavel Labath via lldb-dev
<lldb-dev@lists.llvm.org>:
>
> On 05/12/2018 03:49, Jonas Devlieghere via lldb-dev wrote:
> > Hi everyone,
> >
> > Since we switched to lit as the test driver we've been seeing it getting 
> > killed as the result of a SIGHUP signal. The problem doesn't reproduce on 
> > every machine and there seems to be a correlation between number of 
> > occurrences and thread count.
> >
> > Davide and Raphael spent some time narrowing down what particular test is 
> > causing this and it seems that TestChangeProcessGroup.py is always 
> > involved. However it never reproduces when running just this test. I was 
> > able to reproduce pretty consistently with the following filter:
> >
> > ./bin/llvm-lit ../llvm/tools/lldb/lit/Suite/ --filter="process"
> >
> > Bisecting the test itself didn't help much, the problem reproduces as soon 
> > as we attach to the inferior.
> >
> > At this point it is still not clear who is sending the SIGHUP and why it's 
> > reaching the lit test driver. Fred suggested that it might have something 
> > to do with process groups (which would be an interesting coincidence given 
> > the previously mentioned test) and he suggested having the test run in 
> > different process groups. Indeed, adding a call to os.setpgrp() in lit's 
> > executeCommand and having a different process group per test prevent us 
> > from seeing this. Regardless of this issue I think it's reasonable to have 
> > tests run in their process group, so if nobody objects I propose adding 
> > this to lit in llvm.
> >
> > Still, I'd like to understand where the signal is coming from and fix the 
> > root cause in addition to the symptom. Maybe someone here has an idea of 
> > what might be going on?
> >
> > Thanks,
> > Jonas
> >
> > PS
> >
> > 1. There's two places where we send a SIGHUP ourself, with that code 
> > removed we still receive the signal, which suggests that it might be coming 
> > from Python or the OS.
> > 2. If you're able to reproduce you'll see that adding an early return 
> > before the attach in TestChangeProcessGroup.py hides/prevents the problem. 
> > Moving the return down one line and it pops up again.
> > _______________________________________________
> > lldb-dev mailing list
> > lldb-dev@lists.llvm.org
> > http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev
> >
>
> Hi Jonas,
>
> Sounds like you have found an interesting issue to debug. I've tried
> running the command you mention locally, and I didn't see any failures
> in 100 runs.
>
> There doesn't seem to be anything in the TestChangeProcessGroup which
> sends a signal, though I can imagine that the act of changing a process
> group mid-debug could be enough to confuse someone to send it. However,
> I am having trouble reconciling this with your PS #2, because if
> attaching is sufficient to trigger this (i.e., no group changing takes
> place), then this test is not much different than any other test where
> we spawn an inferior and then attach to it.
>
> I am aware of one other instance where we send a spurious signal, though
> it's SIGINT in this case
> <https://github.com/llvm-mirror/lldb/blob/master/source/Plugins/Process/gdb-remote/ProcessGDBRemote.cpp#L3645>.
> The issue there is that we don't check whether the debug server has
> exited before we send SIGINT to it (which it normally does on its own at
> the end of debug session). So if the debug server does exit and its pid
> gets recycled before we get a chance to send this signal, we can end up
> killing a random process.
>
> Now this may seem unrelated to your issue, but SIGHUP is sent
> automatically as a result of a process losing its controlling tty. So,
> if that SIGINT ends up killing the process holding the master end of a
> pty, this could result in some SIGHUPs being sent too. Unfortunately,
> this doesn't fully stack up either, because the process holding the
> master pty is probably a long-lived one, so its pid is unlikely to match
> one of the transient debugserver pids. Nevertheless, it could be worth
> just commenting out that line and seeing what happens.
>
> For debugging, maybe you could try installing a SIGHUP handler into the
> lit process, which would dump the received siginfo_t structure. Decoding
> that may provide some insight into who is sending that signal (si_pid)
> and why (si_code).
>
> As for adding process group support into lit, I think that having each
> test run (*not* each executed command) in it's own group is reasonable.
> However, be aware that this changes the behaviour of how all signals (in
> particular the SIGINT you get when typing ^C) get delivered. AFAIK, lit
> doesn't have any special code for cleaning up the spawned processes and
> relies on the fact that ^C will send a SIGINT to the entire "foreground
> process group" and terminate stuff. If you start creating a bunch of
> process groups, you may need to add more explicit termination logic too.
>
> cheers,
> pl
> _______________________________________________
> lldb-dev mailing list
> lldb-dev@lists.llvm.org
> http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev
_______________________________________________
lldb-dev mailing list
lldb-dev@lists.llvm.org
http://lists.llvm.org/cgi-bin/mailman/listinfo/lldb-dev

Re: [lldb-dev] The lit test driver gets killed because of SIGHUP

Reply via email to