On Mon, Jul 20, 2020 at 04:35:12AM +0000, Visa Hankala wrote:
> On Sun, Jul 19, 2020 at 09:47:54PM +0100, Julian Smith wrote:
> > I've been finding egdb and gdb rather easily get stuck in an
> > uninterruptible wait, e.g. when running the 'next' command after
> > hitting a breakpoint.
> >
> > So it's not possible to kill the debuggee or gdb and the only way to
> > kill the debuggee process and free up its listening sockets seems to be
> > to reboot the entire system.
> >
> > Perhaps unsurprisingly one cannot attach a second invocation of gdb to
> > the uninterruptible gdb, so i don't know for sure what syscall is being
> > run that is getting stuck.
> >
> > The debuggee is a local build of the flightgear flight simulator.
> >
> > Here's the output of ps for the debugger and debuggee:
> >
> > 12419 p0 D 0:34.37 egdb -ex handle SIGPIPE noprint nostop -ex set
> > print thread-events off -ex set print pretty on -ex run --args
> > build-walk/fgfs,clang,debug,opt,co
> > 63921 p0 TX+ 0:42.45
> > /home/jules/flightgear/build-walk/fgfs,clang,debug,opt,compositor,osg.exe
> > --airport=egtk (fgfs,clang,debug)
> >
> > I've tried using ktrace on egdb, and the kdump output ends like this:
> >
> > 53950 egdb CALL wait4(WAIT_ANY,0x7f7ffffe8efc,0<>,0)
> > 53950 egdb RET wait4 97562/0x17d1a
> > 53950 egdb CALL ptrace(PT_GET_PROCESS_STATE,97562,0x7f7ffffe8ef0,12)
> > 53950 egdb RET ptrace 0
> > 53950 egdb CALL ptrace(PT_GETREGS,161560,0x7f7ffffe8b40,0)
> > 53950 egdb RET ptrace 0
> > 53950 egdb CALL
> > futex(0x6444e37c490,0x82<FUTEX_WAKE|FUTEX_PRIVATE_FLAG>,1,0,0)
> > 53950 egdb RET futex 0
> > 53950 egdb CALL
> > futex(0x644bef12740,0x82<FUTEX_WAKE|FUTEX_PRIVATE_FLAG>,1,0,0)
> > 53950 egdb RET futex 0
> > 53950 egdb CALL ptrace(PT_IO,97562,0x7f7ffffe8a30,0)
> > 53950 egdb RET ptrace 0
> > 53950 egdb CALL ptrace(PT_IO,97562,0x7f7ffffe8a30,0)
> > 53950 egdb RET ptrace 0
> > 53950 egdb CALL ptrace(PT_STEP,97562,0x1,0)
> > 53950 egdb RET ptrace 0
> > 53950 egdb CALL read(6,0x7f7ffffe9187,0x1)
> > 53950 egdb RET read -1 errno 35 Resource temporarily unavailable
> > 53950 egdb CALL poll(0x6441581e720,3,0)
> > 53950 egdb STRU struct pollfd [3] { fd=4, events=0x1<POLLIN>,
> > revents=0<> } { fd=6, events=0x1<POLLIN>, revents=0<> } { fd=10,
> > events=0x1<POLLIN>, revents=0<> }
> > 53950 egdb RET poll 0
> > 53950 egdb CALL wait4(WAIT_ANY,0x7f7ffffe8efc,0<>,0)
> >
> > Assuming that this is the actual end of the ktrace output and there
> > isn't some missing ktrace output in a buffer somewhere, this looks
> > like egdb is simply blocked in wait4(), which should be harmless and
> > certainly not uninterruptable?
>
> The single-thread check done by wait4() is non-interruptible.
> When the debugger gets stuck, is it blocked in "suspend" state?
>
> However, I think there is a bug in the single-thread switch code.
> It looks that ps_singlecount can be decremented too much. This probably
> is a regression of making ps_singlecount unsigned and letting
> single_thread_check() run without the kernel lock.
>
> The bug might go away if single_thread_check() made sure that
> P_SUSPSINGLE is set before the thread suspends.
Below is an updated patch for testing. It extends the scope of
SCHED_LOCK() so that there are fewer chances of interleaving of
single_thread_set() and single_thread_check().
One problem is that once single_thread_set() sets ps_single, the other
threads can enter the suspension code in single_thread_check(). The
extended lock scope prevents the threads from taking action before
single_thread_set() has finished with the state updates.
Index: kern/kern_sig.c
===================================================================
RCS file: src/sys/kern/kern_sig.c,v
retrieving revision 1.258
diff -u -p -r1.258 kern_sig.c
--- kern/kern_sig.c 15 Jun 2020 13:18:33 -0000 1.258
+++ kern/kern_sig.c 20 Jul 2020 13:29:50 -0000
@@ -1915,16 +1915,17 @@ single_thread_check(struct proc *p, int
return (EINTR);
}
+ SCHED_LOCK(s);
if (atomic_dec_int_nv(&pr->ps_singlecount) == 0)
wakeup(&pr->ps_singlecount);
if (pr->ps_flags & PS_SINGLEEXIT) {
+ SCHED_UNLOCK(s);
KERNEL_LOCK();
exit1(p, 0, 0, EXIT_THREAD_NOCHECK);
- KERNEL_UNLOCK();
+ /* NOTREACHED */
}
/* not exiting and don't need to unwind, so suspend */
- SCHED_LOCK(s);
p->p_stat = SSTOP;
mi_switch();
SCHED_UNLOCK(s);
@@ -1950,7 +1951,7 @@ single_thread_set(struct proc *p, enum s
{
struct process *pr = p->p_p;
struct proc *q;
- int error;
+ int error, s;
KERNEL_ASSERT_LOCKED();
KASSERT(curproc == p);
@@ -1974,26 +1975,22 @@ single_thread_set(struct proc *p, enum s
panic("single_thread_mode = %d", mode);
#endif
}
+ SCHED_LOCK(s);
pr->ps_singlecount = 0;
membar_producer();
pr->ps_single = p;
TAILQ_FOREACH(q, &pr->ps_threads, p_thr_link) {
- int s;
-
if (q == p)
continue;
if (q->p_flag & P_WEXIT) {
if (mode == SINGLE_EXIT) {
- SCHED_LOCK(s);
if (q->p_stat == SSTOP) {
setrunnable(q);
atomic_inc_int(&pr->ps_singlecount);
}
- SCHED_UNLOCK(s);
}
continue;
}
- SCHED_LOCK(s);
atomic_setbits_int(&q->p_flag, P_SUSPSINGLE);
switch (q->p_stat) {
case SIDL:
@@ -2027,8 +2024,8 @@ single_thread_set(struct proc *p, enum s
signotify(q);
break;
}
- SCHED_UNLOCK(s);
}
+ SCHED_UNLOCK(s);
if (mode != SINGLE_PTRACE)
single_thread_wait(pr, 1);