http://www.gossamer-threads.com/lists/linux/kernel/993505



fmayhar at google

Nov 5, 2008, 5:58 PM

Post #1 of 3 (11 views)
Permalink
Re: regression introduced by - timers: fix itimer/many thread hang

On Tue, 2008-10-28 at 14:38 -0400, Doug Chapman wrote:
> On Mon, 2008-10-27 at 11:39 -0700, Frank Mayhar wrote:
> > On Wed, 2008-10-22 at 13:03 -0400, Doug Chapman wrote:
> > > Unable to handle kernel paging request at virtual address
> > > 94949494949494a4
> >
> > I take it this can be read as an uninitialized (or cleared) pointer?
> >
> > It certainly looks like this is a race in thread (process?) teardown. I
> > don't have hardware on which to reproduce this but _looks_ like another
> > thread has gotten in and torn down the process while we've been busy.
>
> I finally managed to get kdump working and caught this in the act. I
> still need to dig into this more but I think these 2 threads will show
> us the race condition. Note that this is a slightly hacked kernel in
> that I removed "static" from a few functions to better see what was
> going on but no real functional changes when compared to a recent (day
> old or so) git pull from Linus's tree.

After digging through this a bit, I've concluded that it's probably a
race between process reap and the dequeue_entity() call to update_curr()
combined with a side effect of the slab debug stuff. The
account_group_exec_runtime() routine (like the rest of these routines)
checks tsk->signal and tsk->signal->cputime.totals for NULL to make sure
they're still valid. It looks like at this point tsk->signal is valid
(since the tsk->signal->cputime dereference succeeded) but
tsk->signal->cputime.totals is invalid. That can't happen unless the
process is being reaped, and in fact can only happen in a narrow window
in __cleanup_signal() between the call to thread_group_cputime_free()
and the kmem_cache_free() of the signal struct itself. Without the slab
debug stuff it would either skip the update (by noticing that pointers
were NULL) or blithely update freed structures.

I can't see anything that would prevent this from happening in the
general case. I don't see what would stop the parent from coming in on
another CPU and reaping the process after schedule() has picked it off
the rq but before the deactivate_task->dequeue_task->dequeue_entity->
update_curr chain completes. I see that schedule() disables preemption
on that CPU but will that really do the job? I also suspect that with
hyperthreading both of these processes are on the same silicon, meaning
that one can be unexpectedly suspended while the other runs, thereby
making the window wider.

Unfortunately I don't know the code well enough to know what the right
fix is. Maybe account_group_exec_runtime() should check for PF_EXITED?
Maybe update_curr() should do that? I think that it makes more sense
for dequeue_entity() to do the check and then not call update_curr() if
the task is exiting but I defer to others with more knowledge of this
area.
--
Frank Mayhar <fmayhar[at]google.com>
Google, Inc.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo[at]vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


peterz at infradead

Nov 6, 2008, 3:03 AM

Post #2 of 3 (1 views)
Permalink
Re: regression introduced by - timers: fix itimer/many thread hang [In reply to]

On Wed, 2008-11-05 at 17:58 -0800, Frank Mayhar wrote:
> On Tue, 2008-10-28 at 14:38 -0400, Doug Chapman wrote:
> > On Mon, 2008-10-27 at 11:39 -0700, Frank Mayhar wrote:
> > > On Wed, 2008-10-22 at 13:03 -0400, Doug Chapman wrote:
> > > > Unable to handle kernel paging request at virtual address
> > > > 94949494949494a4
> > >
> > > I take it this can be read as an uninitialized (or cleared) pointer?
> > >
> > > It certainly looks like this is a race in thread (process?) teardown. I
> > > don't have hardware on which to reproduce this but _looks_ like another
> > > thread has gotten in and torn down the process while we've been busy.
> >
> > I finally managed to get kdump working and caught this in the act. I
> > still need to dig into this more but I think these 2 threads will show
> > us the race condition. Note that this is a slightly hacked kernel in
> > that I removed "static" from a few functions to better see what was
> > going on but no real functional changes when compared to a recent (day
> > old or so) git pull from Linus's tree.
>
> After digging through this a bit, I've concluded that it's probably a
> race between process reap and the dequeue_entity() call to update_curr()
> combined with a side effect of the slab debug stuff. The
> account_group_exec_runtime() routine (like the rest of these routines)
> checks tsk->signal and tsk->signal->cputime.totals for NULL to make sure
> they're still valid. It looks like at this point tsk->signal is valid
> (since the tsk->signal->cputime dereference succeeded) but
> tsk->signal->cputime.totals is invalid. That can't happen unless the
> process is being reaped, and in fact can only happen in a narrow window
> in __cleanup_signal() between the call to thread_group_cputime_free()
> and the kmem_cache_free() of the signal struct itself. Without the slab
> debug stuff it would either skip the update (by noticing that pointers
> were NULL) or blithely update freed structures.
>
> I can't see anything that would prevent this from happening in the
> general case. I don't see what would stop the parent from coming in on
> another CPU and reaping the process after schedule() has picked it off
> the rq but before the deactivate_task->dequeue_task->dequeue_entity->
> update_curr chain completes. I see that schedule() disables preemption
> on that CPU but will that really do the job? I also suspect that with
> hyperthreading both of these processes are on the same silicon, meaning
> that one can be unexpectedly suspended while the other runs, thereby
> making the window wider.
>
> Unfortunately I don't know the code well enough to know what the right
> fix is. Maybe account_group_exec_runtime() should check for PF_EXITED?

That is just plain ugly. The right fix to me seems to destroy the
signal/thread group stuff _after_ the last task goes away.

> Maybe update_curr() should do that? I think that it makes more sense
> for dequeue_entity() to do the check and then not call update_curr() if
> the task is exiting but I defer to others with more knowledge of this
> area.

Hell no. Its a race in this signal/thread group stuff, fix it there.

But now you made me look at this patch:

commit f06febc96ba8e0af80bcc3eaec0a109e88275fac
Author: Frank Mayhar <fmayhar[at]google.com>
Date: Fri Sep 12 09:54:39 2008 -0700

timers: fix itimer/many thread hang


and I'm thinking you just rendered big iron unbootable.

You replaced a loop-over-threads by a loop-over-cpus, on every tick. Did
you stop to think what would happen on 4096 cpu machines?

Also, you just introduced per-cpu allocations for each thread-group,
while Christoph is reworking the per-cpu allocator, with one unfortunate
side-effect - its going to have a limited size pool. Therefore this will
limit the number of thread-groups we can have.

Hohumm, not at all liking this..
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo[at]vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/


oleg at redhat

Nov 6, 2008, 4:20 AM

Post #3 of 3 (1 views)
Permalink
Re: regression introduced by - timers: fix itimer/many thread hang [In reply to]

> Begin forwarded message:
>
> On Tue, 2008-10-28 at 14:38 -0400, Doug Chapman wrote:
> > On Mon, 2008-10-27 at 11:39 -0700, Frank Mayhar wrote:
> > > On Wed, 2008-10-22 at 13:03 -0400, Doug Chapman wrote:
> > > > Unable to handle kernel paging request at virtual address
> > > > 94949494949494a4
> > >
> > > I take it this can be read as an uninitialized (or cleared) pointer?
> > >
> > > It certainly looks like this is a race in thread (process?) teardown. I
> > > don't have hardware on which to reproduce this but _looks_ like another
> > > thread has gotten in and torn down the process while we've been busy.
> >
> > I finally managed to get kdump working and caught this in the act. I
> > still need to dig into this more but I think these 2 threads will show
> > us the race condition. Note that this is a slightly hacked kernel in
> > that I removed "static" from a few functions to better see what was
> > going on but no real functional changes when compared to a recent (day
> > old or so) git pull from Linus's tree.
>
> After digging through this a bit, I've concluded that it's probably a
> race between process reap and the dequeue_entity() call to update_curr()
> combined with a side effect of the slab debug stuff. The
> account_group_exec_runtime() routine (like the rest of these routines)
> checks tsk->signal and tsk->signal->cputime.totals for NULL to make sure
> they're still valid. It looks like at this point tsk->signal is valid
> (since the tsk->signal->cputime dereference succeeded) but
> tsk->signal->cputime.totals is invalid. That can't happen unless the
> process is being reaped,

Frank, currently I don't have the source code which I can look at,
so I am probably wrong... But just in case, perhaps we can do

- account_group_exec_runtime(...);
+ if (lock_task_sighand(...)) {
+ account_group_exec_runtime(...);
+ unlock_task_sighand();
+ }

?

Once we take ->siglock the task can't be reaped, and ->signal becomes
stable and != NULL.

Oleg.

Reply via email to