On Sun, 2008-02-24 at 21:22 -0500, Dave Jones wrote: > https://bugzilla.redhat.com/show_bug.cgi?id=431038 has some more info, > but the trace is below... > I'll get an rc3 kernel built and ask the user to retest, but in case this > isn't a known problem, I'm forwarding this here.
I can't fix it but I can explain it. > Feb 24 17:53:21 cirithungol kernel: ip/10650 is trying to acquire lock: > Feb 24 17:53:21 cirithungol kernel: (events){--..}, at: [<c0436f9a>] > flush_workqueue+0x0/0x85 > Feb 24 17:53:21 cirithungol kernel: > Feb 24 17:53:21 cirithungol kernel: but task is already holding lock: > Feb 24 17:53:21 cirithungol kernel: (rtnl_mutex){--..}, at: [<c05cea31>] > rtnetlink_rcv+0x12/0x26 > Feb 24 17:53:21 cirithungol kernel: > Feb 24 17:53:21 cirithungol kernel: which lock already depends on the new > lock. What's happening here is that the linkwatch_work runs on the generic schedule_work() workqueue. > Feb 24 17:53:21 cirithungol kernel: -> #1 ((linkwatch_work).work){--..}: The function that is called is linkwatch_event(), which acquires the RTNL as you can see here: > Feb 24 17:53:21 cirithungol kernel: -> #2 (rtnl_mutex){--..}: > Feb 24 17:53:21 cirithungol kernel: [<c04458f7>] > __lock_acquire+0xa7c/0xbf4 > Feb 24 17:53:21 cirithungol kernel: [<c05cea1d>] rtnl_lock+0xf/0x11 > Feb 24 17:53:21 cirithungol kernel: [<c04415dc>] > tick_program_event+0x31/0x55 > Feb 24 17:53:21 cirithungol kernel: [<c0445ad9>] lock_acquire+0x6a/0x90 > Feb 24 17:53:21 cirithungol kernel: [<c05cea1d>] rtnl_lock+0xf/0x11 > Feb 24 17:53:21 cirithungol kernel: [<c0638d21>] > mutex_lock_nested+0xdb/0x271 > Feb 24 17:53:21 cirithungol kernel: [<c05cea1d>] rtnl_lock+0xf/0x11 > Feb 24 17:53:21 cirithungol kernel:last message repeated 2 times > Feb 24 17:53:21 cirithungol kernel: [<c05cf755>] > linkwatch_event+0x8/0x22 The problem with that is that tulip_down() calls flush_scheduled_work() while holding the RTNL: > Feb 24 17:53:21 cirithungol kernel: [<c0436f9a>] > flush_workqueue+0x0/0x85 > Feb 24 17:53:21 cirithungol kernel: [<c043702c>] > flush_scheduled_work+0xd/0xf > Feb 24 17:53:21 cirithungol kernel: [<f8f4380a>] tulip_down+0x20/0x1a3 > [tulip] [...] > Feb 24 17:53:21 cirithungol kernel: [<c05cea3d>] > rtnetlink_rcv+0x1e/0x26 (rtnetlink_rcv will acquire the RTNL) The deadlock that can now happen is that linkwatch_work is scheduled on the workqueue but not running yet. During tulip_down(), flush_scheduled_work() is called which will wait for everything that is scheduled to complete. Among those things could be linkwatch_event() which will start running and try to acquire the RTNL. Because that is already locked it will wait for the RTNL, but on the other hand we're waiting for linkwatch_event() to finish while holding the RTNL. The fix here would most likely be to not use flush_scheduled_work() but rather cancel_work_sync(). This should be a correct change afaict, unless tulip has more work structs than the media work. @@ tulip_down - flush_scheduled_work(); + cancel_work_sync(&tp->media_work); johannes
signature.asc
Description: This is a digitally signed message part