On Sun, Jul 19, 2020 at 09:47:54PM +0100, Julian Smith wrote:
> I've been finding egdb and gdb rather easily get stuck in an
> uninterruptible wait, e.g. when running the 'next' command after
> hitting a breakpoint.
>
> So it's not possible to kill the debuggee or gdb and the only way to
> kill the debuggee process and free up its listening sockets seems to be
> to reboot the entire system.
>
> Perhaps unsurprisingly one cannot attach a second invocation of gdb to
> the uninterruptible gdb, so i don't know for sure what syscall is being
> run that is getting stuck.
>
> The debuggee is a local build of the flightgear flight simulator.
>
> Here's the output of ps for the debugger and debuggee:
>
> 12419 p0 D 0:34.37 egdb -ex handle SIGPIPE noprint nostop -ex set
> print thread-events off -ex set print pretty on -ex run --args
> build-walk/fgfs,clang,debug,opt,co
> 63921 p0 TX+ 0:42.45
> /home/jules/flightgear/build-walk/fgfs,clang,debug,opt,compositor,osg.exe
> --airport=egtk (fgfs,clang,debug)
>
> I've tried using ktrace on egdb, and the kdump output ends like this:
>
> 53950 egdb CALL wait4(WAIT_ANY,0x7f7ffffe8efc,0<>,0)
> 53950 egdb RET wait4 97562/0x17d1a
> 53950 egdb CALL ptrace(PT_GET_PROCESS_STATE,97562,0x7f7ffffe8ef0,12)
> 53950 egdb RET ptrace 0
> 53950 egdb CALL ptrace(PT_GETREGS,161560,0x7f7ffffe8b40,0)
> 53950 egdb RET ptrace 0
> 53950 egdb CALL
> futex(0x6444e37c490,0x82<FUTEX_WAKE|FUTEX_PRIVATE_FLAG>,1,0,0)
> 53950 egdb RET futex 0
> 53950 egdb CALL
> futex(0x644bef12740,0x82<FUTEX_WAKE|FUTEX_PRIVATE_FLAG>,1,0,0)
> 53950 egdb RET futex 0
> 53950 egdb CALL ptrace(PT_IO,97562,0x7f7ffffe8a30,0)
> 53950 egdb RET ptrace 0
> 53950 egdb CALL ptrace(PT_IO,97562,0x7f7ffffe8a30,0)
> 53950 egdb RET ptrace 0
> 53950 egdb CALL ptrace(PT_STEP,97562,0x1,0)
> 53950 egdb RET ptrace 0
> 53950 egdb CALL read(6,0x7f7ffffe9187,0x1)
> 53950 egdb RET read -1 errno 35 Resource temporarily unavailable
> 53950 egdb CALL poll(0x6441581e720,3,0)
> 53950 egdb STRU struct pollfd [3] { fd=4, events=0x1<POLLIN>,
> revents=0<> } { fd=6, events=0x1<POLLIN>, revents=0<> } { fd=10,
> events=0x1<POLLIN>, revents=0<> }
> 53950 egdb RET poll 0
> 53950 egdb CALL wait4(WAIT_ANY,0x7f7ffffe8efc,0<>,0)
>
> Assuming that this is the actual end of the ktrace output and there
> isn't some missing ktrace output in a buffer somewhere, this looks
> like egdb is simply blocked in wait4(), which should be harmless and
> certainly not uninterruptable?
The single-thread check done by wait4() is non-interruptible.
When the debugger gets stuck, is it blocked in "suspend" state?
However, I think there is a bug in the single-thread switch code.
It looks that ps_singlecount can be decremented too much. This probably
is a regression of making ps_singlecount unsigned and letting
single_thread_check() run without the kernel lock.
The bug might go away if single_thread_check() made sure that
P_SUSPSINGLE is set before the thread suspends.
Does the following patch help? Even if it does, it probably needs
some refining.
Index: kern/kern_sig.c
===================================================================
RCS file: src/sys/kern/kern_sig.c,v
retrieving revision 1.258
diff -u -p -r1.258 kern_sig.c
--- kern/kern_sig.c 15 Jun 2020 13:18:33 -0000 1.258
+++ kern/kern_sig.c 20 Jul 2020 04:27:30 -0000
@@ -1915,16 +1915,23 @@ single_thread_check(struct proc *p, int
return (EINTR);
}
- if (atomic_dec_int_nv(&pr->ps_singlecount) == 0)
- wakeup(&pr->ps_singlecount);
+ SCHED_LOCK(s);
+ if (p->p_flag & P_SUSPSINGLE) {
+ if (atomic_dec_int_nv(&pr->ps_singlecount) == 0)
+ wakeup(&pr->ps_singlecount);
+ } else if ((p->p_flag & P_WEXIT) == 0) {
+ SCHED_UNLOCK(s);
+ CPU_BUSY_CYCLE();
+ continue;
+ }
if (pr->ps_flags & PS_SINGLEEXIT) {
+ SCHED_UNLOCK(s);
KERNEL_LOCK();
exit1(p, 0, 0, EXIT_THREAD_NOCHECK);
- KERNEL_UNLOCK();
+ /* NOTREACHED */
}
/* not exiting and don't need to unwind, so suspend */
- SCHED_LOCK(s);
p->p_stat = SSTOP;
mi_switch();
SCHED_UNLOCK(s);