http://www.gossamer-threads.com/lists/linux/kernel/993505
|
|
 |
 |

fmayhar
at google
Nov 5, 2008, 5:58 PM
Post #1 of 3 (11 views)
Permalink
|
|
Re: regression introduced
by - timers: fix itimer/many thread hang
|
|
On Tue, 2008-10-28 at 14:38 -0400, Doug Chapman wrote:
> On Mon, 2008-10-27 at
11:39 -0700, Frank Mayhar wrote:
> > On Wed,
2008-10-22 at 13:03 -0400, Doug Chapman wrote:
> > > Unable to
handle kernel paging request at virtual address
> > > 94949494949494a4
> >
> > I take it this can be read as an uninitialized (or cleared)
pointer?
> >
> > It certainly looks like this is a race in thread (process?)
teardown. I
> > don't have hardware on which to reproduce this but _looks_
like another
> > thread has gotten in and torn down the process while we've
been busy.
>
> I finally managed to get kdump working and caught this in the act.
I
> still need to dig into this more but I think these 2 threads will
show
> us the race condition. Note that this is a slightly hacked kernel
in
> that I removed "static" from a few functions to better see what
was
> going on but no real functional changes when compared to a recent
(day
> old or so) git pull from Linus's tree.
After digging through this a bit, I've concluded that it's probably a
race between process reap and the dequeue_entity() call to
update_curr()
combined with a side effect of the slab debug stuff. The
account_group_exec_runtime() routine (like the rest of these routines)
checks tsk->signal and tsk->signal->cputime.totals for NULL to
make sure
they're still valid. It looks like at this point tsk->signal is
valid
(since the tsk->signal->cputime dereference succeeded) but
tsk->signal->cputime.totals is invalid. That can't happen unless
the
process is being reaped, and in fact can only happen in a narrow window
in __cleanup_signal() between the call to thread_group_cputime_free()
and the kmem_cache_free() of the signal struct itself. Without the slab
debug stuff it would either skip the update (by noticing that pointers
were NULL) or blithely update freed structures.
I can't see anything that would prevent this from happening in the
general case. I don't see what would stop the parent from coming in on
another CPU and reaping the process after schedule() has picked it off
the rq but before the
deactivate_task->dequeue_task->dequeue_entity->
update_curr chain completes. I see that schedule() disables preemption
on that CPU but will that really do the job? I also suspect that with
hyperthreading both of these processes are on the same silicon, meaning
that one can be unexpectedly suspended while the other runs, thereby
making the window wider.
Unfortunately I don't know the code well enough to know what the right
fix is. Maybe account_group_exec_runtime() should check for PF_EXITED?
Maybe update_curr() should do that? I think that it makes more sense
for dequeue_entity() to do the check and then not call update_curr() if
the task is exiting but I defer to others with more knowledge of this
area.
--
Frank Mayhar <fmayhar[at]google.com>
Google, Inc.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel"
in
the body of a message to majordomo[at]vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
|
|
|
 |
 |

peterz
at infradead
Nov 6, 2008, 3:03 AM
Post #2 of 3 (1 views)
Permalink
|
|
Re: regression introduced
by - timers: fix itimer/many thread hang [In
reply to]
|
|
On Wed, 2008-11-05 at 17:58 -0800, Frank Mayhar wrote:
> On Tue, 2008-10-28 at
14:38 -0400, Doug Chapman wrote:
> > On Mon,
2008-10-27 at 11:39 -0700, Frank Mayhar wrote:
> > > On Wed,
2008-10-22 at 13:03 -0400, Doug Chapman wrote:
> > > >
Unable to handle kernel paging request at virtual address
> > > > 94949494949494a4
> > >
> > > I take it this can be read as an uninitialized (or
cleared) pointer?
> > >
> > > It certainly looks like this is a race in thread
(process?) teardown. I
> > > don't have hardware on which to reproduce this but
_looks_ like another
> > > thread has gotten in and torn down the process while
we've been busy.
> >
> > I finally managed to get kdump working and caught this in the
act. I
> > still need to dig into this more but I think these 2 threads
will show
> > us the race condition. Note that this is a slightly hacked
kernel in
> > that I removed "static" from a few functions to better see
what was
> > going on but no real functional changes when compared to a
recent (day
> > old or so) git pull from Linus's tree.
>
> After digging through this a bit, I've concluded that it's
probably a
> race between process reap and the dequeue_entity() call to
update_curr()
> combined with a side effect of the slab debug stuff. The
> account_group_exec_runtime() routine (like the rest of these
routines)
> checks tsk->signal and tsk->signal->cputime.totals for
NULL to make sure
> they're still valid. It looks like at this point tsk->signal is
valid
> (since the tsk->signal->cputime dereference succeeded) but
> tsk->signal->cputime.totals is invalid. That can't happen
unless the
> process is being reaped, and in fact can only happen in a narrow
window
> in __cleanup_signal() between the call to
thread_group_cputime_free()
> and the kmem_cache_free() of the signal struct itself. Without the
slab
> debug stuff it would either skip the update (by noticing that
pointers
> were NULL) or blithely update freed structures.
>
> I can't see anything that would prevent this from happening in the
> general case. I don't see what would stop the parent from coming
in on
> another CPU and reaping the process after schedule() has picked it
off
> the rq but before the
deactivate_task->dequeue_task->dequeue_entity->
> update_curr chain completes. I see that schedule() disables
preemption
> on that CPU but will that really do the job? I also suspect that
with
> hyperthreading both of these processes are on the same silicon,
meaning
> that one can be unexpectedly suspended while the other runs,
thereby
> making the window wider.
>
> Unfortunately I don't know the code well enough to know what the
right
> fix is. Maybe account_group_exec_runtime() should check for
PF_EXITED?
That is just plain ugly. The right fix to me seems to destroy the
signal/thread group stuff _after_ the last task goes away.
> Maybe update_curr() should
do that? I think that it makes more sense
> for dequeue_entity() to do the check and then not call
update_curr() if
> the task is exiting but I defer to others with more knowledge of
this
> area.
Hell no. Its a race in this signal/thread group stuff, fix it there.
But now you made me look at this patch:
commit f06febc96ba8e0af80bcc3eaec0a109e88275fac
Author: Frank Mayhar <fmayhar[at]google.com>
Date: Fri Sep 12 09:54:39 2008 -0700
timers: fix itimer/many thread hang
and I'm thinking you just rendered big iron unbootable.
You replaced a loop-over-threads by a loop-over-cpus, on every tick.
Did
you stop to think what would happen on 4096 cpu machines?
Also, you just introduced per-cpu allocations for each thread-group,
while Christoph is reworking the per-cpu allocator, with one
unfortunate
side-effect - its going to have a limited size pool. Therefore this
will
limit the number of thread-groups we can have.
Hohumm, not at all liking this..
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel"
in
the body of a message to majordomo[at]vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
|
|
|
 |
 |

oleg
at redhat
Nov 6, 2008, 4:20 AM
Post #3 of 3 (1 views)
Permalink
|
|
Re: regression introduced
by - timers: fix itimer/many thread hang [In
reply to]
|
|
> Begin forwarded message:
>
> On Tue, 2008-10-28 at 14:38 -0400, Doug Chapman wrote:
> > On Mon,
2008-10-27 at 11:39 -0700, Frank Mayhar wrote:
> > > On Wed,
2008-10-22 at 13:03 -0400, Doug Chapman wrote:
> > > >
Unable to handle kernel paging request at virtual address
> > > > 94949494949494a4
> > >
> > > I take it this can be read as an uninitialized (or
cleared) pointer?
> > >
> > > It certainly looks like this is a race in thread
(process?) teardown. I
> > > don't have hardware on which to reproduce this but
_looks_ like another
> > > thread has gotten in and torn down the process while
we've been busy.
> >
> > I finally managed to get kdump working and caught this in the
act. I
> > still need to dig into this more but I think these 2 threads
will show
> > us the race condition. Note that this is a slightly hacked
kernel in
> > that I removed "static" from a few functions to better see
what was
> > going on but no real functional changes when compared to a
recent (day
> > old or so) git pull from Linus's tree.
>
> After digging through this a bit, I've concluded that it's
probably a
> race between process reap and the dequeue_entity() call to
update_curr()
> combined with a side effect of the slab debug stuff. The
> account_group_exec_runtime() routine (like the rest of these
routines)
> checks tsk->signal and tsk->signal->cputime.totals for
NULL to make sure
> they're still valid. It looks like at this point tsk->signal is
valid
> (since the tsk->signal->cputime dereference succeeded) but
> tsk->signal->cputime.totals is invalid. That can't happen
unless the
> process is being reaped,
Frank, currently I don't have the source code which I can look at,
so I am probably wrong... But just in case, perhaps we can do
- account_group_exec_runtime(...);
+ if (lock_task_sighand(...)) {
+ account_group_exec_runtime(...);
+ unlock_task_sighand();
+ }
?
Once we take ->siglock the task can't be reaped, and ->signal
becomes
stable and != NULL.
Oleg.
|
|
|
|