On Fri, Apr 21, 2017 at 10:25 AM, Linus Torvalds <torva...@linux-foundation.org> wrote: > > I'm assuming that the real cause is simply that "dev->reg_state" ends > up being NETREG_UNREGISTERING or something. Maybe the BUG_ON() could > be just removed, and replaced by the previous warning about > NETREG_UNINITIALIZED. > > Something like the attached (TOTALLY UNTESTED) patch.
.. might as well test it. That patch doesn't fix the problem, but it does show that yes, it was NETREG_UNREGISTERING: unregister_netdevice: device pim6reg/ffff962dc4606000 was not registered (2) but then immediately afterwards we get general protection fault: 0000 [#1] SMP Workqueue: netns cleanup_net RIP: 0010:dev_shutdown+0xe/0xc0 Call Trace: rollback_registered_many+0x2a5/0x440 unregister_netdevice_many+0x1e/0xb0 default_device_exit_batch+0x145/0x170 which is due to a mov 0x388(%rdi),%eax where %rdi is 0xdead000000000090. That is at the very beginning of dev_shutdown, it's "dev" itself that has that value, so it comes from (_another_) invocation of rollback_registered_many(), when it does that list_for_each_entry(dev, head, unreg_list) { so it seems to be a case of another "list_del() leaves list in bad state", and it was the added test for "dev->reg_state != NETREG_REGISTERED" that did that list_del(&dev->unreg_list); and left random contents in the unreg_list. So that "handle error case" was almost certainly just buggy too. And the bug seems to be that we're trying to unregister a netdevice that has already been unregistered. Over to Eric and networking people. This oops is user-triggerable, and leaves the machine in a bad state (the original BUG_ON() and the new GP fault both happen while holding the RTNL, so networking is not healthy afterwards. Linus