On Sun, 2008-02-24 at 21:22 -0500, Dave Jones wrote:
> https://bugzilla.redhat.com/show_bug.cgi?id=431038 has some more info,
> but the trace is below...
> I'll get an rc3 kernel built and ask the user to retest, but in case this
> isn't a known problem, I'm forwarding this here.

I can't fix it but I can explain it.

> Feb 24 17:53:21 cirithungol kernel: ip/10650 is trying to acquire lock:
> Feb 24 17:53:21 cirithungol kernel:  (events){--..}, at: [<c0436f9a>] 
> flush_workqueue+0x0/0x85
> Feb 24 17:53:21 cirithungol kernel: 
> Feb 24 17:53:21 cirithungol kernel: but task is already holding lock:
> Feb 24 17:53:21 cirithungol kernel:  (rtnl_mutex){--..}, at: [<c05cea31>] 
> rtnetlink_rcv+0x12/0x26
> Feb 24 17:53:21 cirithungol kernel: 
> Feb 24 17:53:21 cirithungol kernel: which lock already depends on the new 
> lock.

What's happening here is that the linkwatch_work runs on the generic
schedule_work() workqueue.

> Feb 24 17:53:21 cirithungol kernel: -> #1 ((linkwatch_work).work){--..}:

The function that is called is linkwatch_event(), which acquires the
RTNL as you can see here:

> Feb 24 17:53:21 cirithungol kernel: -> #2 (rtnl_mutex){--..}:
> Feb 24 17:53:21 cirithungol kernel:        [<c04458f7>] 
> __lock_acquire+0xa7c/0xbf4
> Feb 24 17:53:21 cirithungol kernel:        [<c05cea1d>] rtnl_lock+0xf/0x11
> Feb 24 17:53:21 cirithungol kernel:        [<c04415dc>] 
> tick_program_event+0x31/0x55
> Feb 24 17:53:21 cirithungol kernel:        [<c0445ad9>] lock_acquire+0x6a/0x90
> Feb 24 17:53:21 cirithungol kernel:        [<c05cea1d>] rtnl_lock+0xf/0x11
> Feb 24 17:53:21 cirithungol kernel:        [<c0638d21>] 
> mutex_lock_nested+0xdb/0x271
> Feb 24 17:53:21 cirithungol kernel:        [<c05cea1d>] rtnl_lock+0xf/0x11
> Feb 24 17:53:21 cirithungol kernel:last message repeated 2 times
> Feb 24 17:53:21 cirithungol kernel:        [<c05cf755>] 
> linkwatch_event+0x8/0x22

The problem with that is that tulip_down() calls flush_scheduled_work()
while holding the RTNL:

> Feb 24 17:53:21 cirithungol kernel:        [<c0436f9a>] 
> flush_workqueue+0x0/0x85
> Feb 24 17:53:21 cirithungol kernel:        [<c043702c>] 
> flush_scheduled_work+0xd/0xf
> Feb 24 17:53:21 cirithungol kernel:        [<f8f4380a>] tulip_down+0x20/0x1a3 
> [tulip]
[...]
> Feb 24 17:53:21 cirithungol kernel:        [<c05cea3d>] 
> rtnetlink_rcv+0x1e/0x26

(rtnetlink_rcv will acquire the RTNL)


The deadlock that can now happen is that linkwatch_work is scheduled on
the workqueue but not running yet. During tulip_down(),
flush_scheduled_work() is called which will wait for everything that is
scheduled to complete. Among those things could be linkwatch_event()
which will start running and try to acquire the RTNL. Because that is
already locked it will wait for the RTNL, but on the other hand we're
waiting for linkwatch_event() to finish while holding the RTNL.

The fix here would most likely be to not use flush_scheduled_work() but
rather cancel_work_sync().

This should be a correct change afaict, unless tulip has more work
structs than the media work.

@@ tulip_down
-       flush_scheduled_work();
+       cancel_work_sync(&tp->media_work);

johannes

Attachment: signature.asc
Description: This is a digitally signed message part

Reply via email to