On Sun, 2008-02-24 at 21:22 -0500, Dave Jones wrote: > https://bugzilla.redhat.com/show_bug.cgi?id=431038 has some more info, > but the trace is below... > I'll get an rc3 kernel built and ask the user to retest, but in case this > isn't a known problem, I'm forwarding this here.
I can't fix it but I can explain it.
> Feb 24 17:53:21 cirithungol kernel: ip/10650 is trying to acquire lock:
> Feb 24 17:53:21 cirithungol kernel: (events){--..}, at: [<c0436f9a>]
> flush_workqueue+0x0/0x85
> Feb 24 17:53:21 cirithungol kernel:
> Feb 24 17:53:21 cirithungol kernel: but task is already holding lock:
> Feb 24 17:53:21 cirithungol kernel: (rtnl_mutex){--..}, at: [<c05cea31>]
> rtnetlink_rcv+0x12/0x26
> Feb 24 17:53:21 cirithungol kernel:
> Feb 24 17:53:21 cirithungol kernel: which lock already depends on the new
> lock.
What's happening here is that the linkwatch_work runs on the generic
schedule_work() workqueue.
> Feb 24 17:53:21 cirithungol kernel: -> #1 ((linkwatch_work).work){--..}:
The function that is called is linkwatch_event(), which acquires the
RTNL as you can see here:
> Feb 24 17:53:21 cirithungol kernel: -> #2 (rtnl_mutex){--..}:
> Feb 24 17:53:21 cirithungol kernel: [<c04458f7>]
> __lock_acquire+0xa7c/0xbf4
> Feb 24 17:53:21 cirithungol kernel: [<c05cea1d>] rtnl_lock+0xf/0x11
> Feb 24 17:53:21 cirithungol kernel: [<c04415dc>]
> tick_program_event+0x31/0x55
> Feb 24 17:53:21 cirithungol kernel: [<c0445ad9>] lock_acquire+0x6a/0x90
> Feb 24 17:53:21 cirithungol kernel: [<c05cea1d>] rtnl_lock+0xf/0x11
> Feb 24 17:53:21 cirithungol kernel: [<c0638d21>]
> mutex_lock_nested+0xdb/0x271
> Feb 24 17:53:21 cirithungol kernel: [<c05cea1d>] rtnl_lock+0xf/0x11
> Feb 24 17:53:21 cirithungol kernel:last message repeated 2 times
> Feb 24 17:53:21 cirithungol kernel: [<c05cf755>]
> linkwatch_event+0x8/0x22
The problem with that is that tulip_down() calls flush_scheduled_work()
while holding the RTNL:
> Feb 24 17:53:21 cirithungol kernel: [<c0436f9a>]
> flush_workqueue+0x0/0x85
> Feb 24 17:53:21 cirithungol kernel: [<c043702c>]
> flush_scheduled_work+0xd/0xf
> Feb 24 17:53:21 cirithungol kernel: [<f8f4380a>] tulip_down+0x20/0x1a3
> [tulip]
[...]
> Feb 24 17:53:21 cirithungol kernel: [<c05cea3d>]
> rtnetlink_rcv+0x1e/0x26
(rtnetlink_rcv will acquire the RTNL)
The deadlock that can now happen is that linkwatch_work is scheduled on
the workqueue but not running yet. During tulip_down(),
flush_scheduled_work() is called which will wait for everything that is
scheduled to complete. Among those things could be linkwatch_event()
which will start running and try to acquire the RTNL. Because that is
already locked it will wait for the RTNL, but on the other hand we're
waiting for linkwatch_event() to finish while holding the RTNL.
The fix here would most likely be to not use flush_scheduled_work() but
rather cancel_work_sync().
This should be a correct change afaict, unless tulip has more work
structs than the media work.
@@ tulip_down
- flush_scheduled_work();
+ cancel_work_sync(&tp->media_work);
johannes
signature.asc
Description: This is a digitally signed message part
