Happy Holidays!

After some more investigation I've changed the patch to what I think is
a more appropriate fix, though it will only fix cases where the
container hang is due to a TCP kernel socket hanging while trying to
close itself (waiting for FIN close sequence, which will never complete
since the container interfaces are shut down).

I also added some kernel-owned socket debug so in case of another
'waiting for lo to become free' hang it will print out all open kernel
sockets along with their family, state, and creating function, which
hopefully will help debug more, assuming my TCP patch doesn't cover all
cases.

The same ppa has been updated with the new patches:
https://launchpad.net/~ddstreet/+archive/ubuntu/lp1711407

And if you would like to see the patches they're in my lp git repo:
https://code.launchpad.net/~ddstreet/+git/linux/+ref/lp1711407-tcp-check-xenial

Can those able to reproduce this bug please test with any of the kernels
from that ppa?  There is some debug output to let you track container
net namespace lifetimes; you should see log entries like:

[  229.134701] net_alloc: created netns ffff8800b99cc980
...
[  438.930538] net_free: freed netns ffff8800b99cc980

you can track the alloc and free of each container that way, to make
sure none of them are remaining open (i.e. leaking).  Note that netns
addresses may be re-used, so you may see something like:

[  228.797533] net_free: freed netns ffff8800baab1880
...
[  352.000520] net_alloc: created netns ffff8800baab1880

that's fine and not a problem.

Also, if you see this line in the logs:

[   22.149194] TCP: our netns is exiting

that indicates the patch to fix this problem detected the problem and
closed the hanging TCP socket.

If you can reproduce the issue with any of those test kernels, please
post one of the socket debug output sections that look like:

[  410.258505] unregister_netdevice: waiting for lo to become free. Usage count 
= 1
[  410.261385] netdev_wait_allrefs: waiting on sk ffff8800b9662800 family 16 
state 7 creator __netlink_kernel_create+0x6d/0x260
[  410.261390] netdev_wait_allrefs: waiting on sk ffff8800b9662000 family 16 
state 7 creator __netlink_kernel_create+0x6d/0x260
[  410.261408] netdev_wait_allrefs: waiting on sk ffff8800b9661000 family 16 
state 7 creator __netlink_kernel_create+0x6d/0x260
[  410.261411] netdev_wait_allrefs: waiting on sk ffff8800baeb4000 family 16 
state 7 creator __netlink_kernel_create+0x6d/0x260
[  410.261416] netdev_wait_allrefs: waiting on sk ffff8800b9699680 family 2 
state 7 creator inet_ctl_sock_create+0x35/0x80
[  410.261420] netdev_wait_allrefs: waiting on sk ffff8800bb3743c0 family 2 
state 7 creator inet_ctl_sock_create+0x35/0x80
[  410.261423] netdev_wait_allrefs: waiting on sk ffff8800bb374780 family 2 
state 7 creator inet_ctl_sock_create+0x35/0x80
[  410.261426] netdev_wait_allrefs: waiting on sk ffff8800bb374b40 family 2 
state 7 creator inet_ctl_sock_create+0x35/0x80
[  410.261430] netdev_wait_allrefs: waiting on sk ffff8800bb374f00 family 2 
state 7 creator inet_ctl_sock_create+0x35/0x80
[  410.261433] netdev_wait_allrefs: waiting on sk ffff8800bb3752c0 family 2 
state 7 creator inet_ctl_sock_create+0x35/0x80
[  410.261436] netdev_wait_allrefs: waiting on sk ffff8800bb375680 family 2 
state 7 creator inet_ctl_sock_create+0x35/0x80
[  410.261439] netdev_wait_allrefs: waiting on sk ffff8800bb375a40 family 2 
state 7 creator inet_ctl_sock_create+0x35/0x80
[  410.261442] netdev_wait_allrefs: waiting on sk ffff8800bb375e00 family 2 
state 7 creator inet_ctl_sock_create+0x35/0x80
[  410.261446] netdev_wait_allrefs: waiting on sk ffff8800baeb1000 family 16 
state 7 creator __netlink_kernel_create+0x6d/0x260
[  410.261449] netdev_wait_allrefs: waiting on sk ffff8800baeb2800 family 16 
state 7 creator __netlink_kernel_create+0x6d/0x260
[  410.261452] netdev_wait_allrefs: waiting on sk ffff8800b9f18000 family 10 
state 7 creator inet_ctl_sock_create+0x35/0x80
[  410.261456] netdev_wait_allrefs: waiting on sk ffff8800b9f18480 family 10 
state 7 creator inet_ctl_sock_create+0x35/0x80
[  410.261459] netdev_wait_allrefs: waiting on sk ffff8800b9f18900 family 10 
state 7 creator inet_ctl_sock_create+0x35/0x80
[  410.261462] netdev_wait_allrefs: waiting on sk ffff8800b9f18d80 family 10 
state 7 creator inet_ctl_sock_create+0x35/0x80
[  410.261465] netdev_wait_allrefs: waiting on sk ffff8800b9f19200 family 10 
state 7 creator inet_ctl_sock_create+0x35/0x80
[  410.261469] netdev_wait_allrefs: waiting on sk ffff8800b9f19680 family 10 
state 7 creator inet_ctl_sock_create+0x35/0x80
[  410.261472] netdev_wait_allrefs: waiting on sk ffff8800b9f19b00 family 10 
state 7 creator inet_ctl_sock_create+0x35/0x80
[  410.261475] netdev_wait_allrefs: waiting on sk ffff8800b9f19f80 family 10 
state 7 creator inet_ctl_sock_create+0x35/0x80
[  410.261478] netdev_wait_allrefs: waiting on sk ffff8800baeb7000 family 16 
state 7 creator __netlink_kernel_create+0x6d/0x260
[  410.261510] netdev_wait_allrefs: waiting on sk ffff8800ba4c0000 family 2 
state 4 creator generic_ip_connect+0x3a4/0x540 [cifs]


the most important is likely the last line, for the socket that isn't in TCP 
state 7 (closed).  In the above example, the last socket is in TCP state 4 
(fin_wait1), and was created by the cifs driver.

Also, if the new test kernel does fix the problem for you, please let me
know about that as well.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1711407

Title:
  unregister_netdevice: waiting for lo to become free

Status in linux package in Ubuntu:
  In Progress
Status in linux source package in Trusty:
  In Progress
Status in linux source package in Xenial:
  In Progress
Status in linux source package in Zesty:
  In Progress
Status in linux source package in Artful:
  In Progress
Status in linux source package in Bionic:
  In Progress

Bug description:
  This is a "continuation" of bug 1403152, as that bug has been marked
  "fix released" and recent reports of failure may (or may not) be a new
  bug.  Any further reports of the problem should please be reported
  here instead of that bug.

  --

  [Impact]

  When shutting down and starting containers the container network
  namespace may experience a dst reference counting leak which results
  in this message repeated in the logs:

      unregister_netdevice: waiting for lo to become free. Usage count =
  1

  This can cause issues when trying to create net network namespace and
  thus block a user from creating new containers.

  [Test Case]

  See comment 16, reproducer provided at https://github.com/fho/docker-
  samba-loop

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1711407/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to