Public bug reported: I am running a single nova-network node as gateway, and have about 20 KVM instances spreaded over 4 compute nodes (one of them is also controller node), and everything is Ubuntu 12.04 LTS.
>From time to time one or another instance WILL loose connectivity, that is, it still has its IP address (dhcp lease times raised up to 7 days) but still, no communication back nor forth is possible. This pretty much looks like some kind of networking problem, but what exactly stopped working? I connected to the failing KVM instance via VNC, and checked its interface, which looks pretty normal (like the others, working ones) On the hypervisor, I am having the following state: root@colossus09:~# ifconfig br100 Link encap:Ethernet HWaddr 00:25:90:49:d9:04 inet6 addr: fe80::2c48:74ff:fe22:a6cb/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:85845 errors:0 dropped:15 overruns:0 frame:0 TX packets:7463 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:2906526 (2.9 MB) TX bytes:641770 (641.7 KB) eth0 Link encap:Ethernet HWaddr 00:25:90:49:d9:04 inet addr:10.10.30.189 Bcast:10.10.31.255 Mask:255.255.224.0 inet6 addr: fe80::225:90ff:fe49:d904/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:1359563761 errors:0 dropped:174156 overruns:2 frame:0 TX packets:1222020947 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:1111996716949 (1.1 TB) TX bytes:673176161112 (673.1 GB) Memory:fafe0000-fb000000 lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:4628041 errors:0 dropped:0 overruns:0 frame:0 TX packets:4628041 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:1060925632 (1.0 GB) TX bytes:1060925632 (1.0 GB) vlan100 Link encap:Ethernet HWaddr 00:25:90:49:d9:04 inet6 addr: fe80::225:90ff:fe49:d904/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:909059394 errors:0 dropped:0 overruns:0 frame:0 TX packets:907044613 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:1053993706102 (1.0 TB) TX bytes:641297608033 (641.2 GB) vnet0 Link encap:Ethernet HWaddr fe:16:3e:3e:f4:58 inet6 addr: fe80::fc16:3eff:fe3e:f458/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:62963968 errors:0 dropped:0 overruns:0 frame:0 TX packets:61960786 errors:0 dropped:0 overruns:1 carrier:0 collisions:0 txqueuelen:500 RX bytes:52542425624 (52.5 GB) TX bytes:84912733569 (84.9 GB) vnet1 Link encap:Ethernet HWaddr fe:16:3e:01:ec:81 inet6 addr: fe80::fc16:3eff:fe01:ec81/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:1280 errors:0 dropped:0 overruns:0 frame:0 TX packets:56964 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:500 RX bytes:110032 (110.0 KB) TX bytes:2461222 (2.4 MB) vnet2 Link encap:Ethernet HWaddr fe:16:3e:3c:46:1b inet6 addr: fe80::fc16:3eff:fe3c:461b/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:34725792 errors:0 dropped:0 overruns:0 frame:0 TX packets:35909449 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:500 RX bytes:2321718823 (2.3 GB) TX bytes:10039460160 (10.0 GB) vnet0 is almost definitely the device to the failing KVM. root@colossus09:~# ps afx | grep /kvm 5080 pts/12 S+ 0:00 \_ grep --color=auto /kvm 1811 ? Sl 848:32 /usr/bin/kvm -S -M pc-1.0 -enable-kvm -m 16384 -smp 8,sockets=8,cores=1,threads=1 -name instance-00000036 -uuid 6dee1800-6e1e-42dd-abe9-8d8efa752bc5 -nodefconfig -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/instance-00000036.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -drive file=/var/lib/nova/instances/instance-00000036/disk,if=none,id=drive-virtio-disk0,format=qcow2,cache=none -device virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -drive file=/var/lib/nova/instances/instance-00000036/disk.local,if=none,id=drive-virtio-disk1,format=qcow2,cache=none -device virtio-blk-pci,bus=pci.0,addr=0x5,drive=drive-virtio-disk1,id=virtio-disk1 -netdev tap,fd=21,id=hostnet0 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:3c:46:1b,bus=pci.0,addr=0x3 -chardev file,id=charserial0,path=/var/lib/nova/instances/instance-00000036/console.log -device isa-serial,chardev=charserial0,id=serial0 -chardev pty,id=charserial1 -device isa-serial,chardev=charserial1,id=serial1 -usb -device usb-tablet,id=input0 -vnc 0.0.0.0:2 -k en-us -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 2275 ? Sl 4235:21 /usr/bin/kvm -S -M pc-1.0 -enable-kvm -m 32768 -smp 20,sockets=20,cores=1,threads=1 -name instance-00000011 -uuid 48e3db02-a8ec-4140-8faa-d1f1f101ef29 -nodefconfig -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/instance-00000011.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -drive file=/var/lib/nova/instances/instance-00000011/disk,if=none,id=drive-virtio-disk0,format=qcow2,cache=none -device virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=17,id=hostnet0 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:3e:f4:58,bus=pci.0,addr=0x3 -chardev file,id=charserial0,path=/var/lib/nova/instances/instance-00000011/console.log -device isa-serial,chardev=charserial0,id=serial0 -chardev pty,id=charserial1 -device isa-serial,chardev=charserial1,id=serial1 -usb -device usb-tablet,id=input0 -vnc 0.0.0.0:0 -k en-us -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 2667 ? Sl 28:37 /usr/bin/kvm -S -M pc-1.0 -enable-kvm -m 512 -smp 1,sockets=1,cores=1,threads=1 -name instance-0000000f -uuid cb9aed4b-5daa-4c1c-85a6-9101adddde8d -nodefconfig -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/instance-0000000f.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -drive file=/var/lib/nova/instances/instance-0000000f/disk,if=none,id=drive-virtio-disk0,format=qcow2,cache=none -device virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=16,id=hostnet0 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:01:ec:81,bus=pci.0,addr=0x3 -chardev file,id=charserial0,path=/var/lib/nova/instances/instance-0000000f/console.log -device isa-serial,chardev=charserial0,id=serial0 -chardev pty,id=charserial1 -device isa-serial,chardev=charserial1,id=serial1 -usb -device usb-tablet,id=input0 -vnc 0.0.0.0:1 -k en-us -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 root@colossus09:~# route -n Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface 0.0.0.0 10.10.30.4 0.0.0.0 UG 100 0 0 eth0 10.10.0.0 0.0.0.0 255.255.224.0 U 0 0 0 eth0 root@colossus09:~# brctl show bridge name bridge id STP enabled interfaces br100 8000.00259049d904 no vlan100 vnet0 vnet1 vnet2 root@colossus09:~# dmesg | grep vnet0 | tail -n 5 [827452.395730] br100: port 2(vnet0) entering disabled state [827468.595961] device vnet0 entered promiscuous mode [827468.661699] br100: port 2(vnet0) entering forwarding state [827468.661705] br100: port 2(vnet0) entering forwarding state [827479.315601] vnet0: no IPv6 routers present root@colossus09:~# brctl showmacs br100 port no mac addr is local? ageing timer 1 00:25:90:2b:63:de no 22.36 1 00:25:90:49:bf:ce no 21.91 1 00:25:90:49:bf:e2 no 22.56 1 00:25:90:49:d9:04 yes 0.00 3 fa:16:3e:01:ec:81 no 107.35 1 fa:16:3e:14:b4:16 no 21.90 1 fa:16:3e:28:e0:ab no 21.74 1 fa:16:3e:2b:be:38 no 21.65 1 fa:16:3e:31:92:53 no 21.78 1 fa:16:3e:3b:74:7a no 21.92 4 fa:16:3e:3c:46:1b no 0.00 1 fa:16:3e:3d:ff:f3 no 0.00 2 fa:16:3e:3e:f4:58 no 0.77 1 fa:16:3e:42:8f:59 no 22.36 1 fa:16:3e:43:bb:04 no 21.92 1 fa:16:3e:47:65:c8 no 21.99 1 fa:16:3e:4f:e6:4c no 21.78 1 fa:16:3e:57:a7:e6 no 22.50 1 fa:16:3e:59:8d:93 no 0.50 1 fa:16:3e:64:fc:b5 no 21.71 1 fa:16:3e:67:dc:73 no 22.27 1 fa:16:3e:72:7f:3d no 21.80 1 fa:16:3e:7f:8c:5c no 22.03 3 fe:16:3e:01:ec:81 yes 0.00 4 fe:16:3e:3c:46:1b yes 0.00 2 fe:16:3e:3e:f4:58 yes 0.00 I have then added an IP to br100, so I can directly test via PING. root@colossus09:~# ip addr add 10.10.40.90/21 dev br100 root@colossus09:~# ip route flush table cache root@colossus09:~# ping -c1 10.10.40.17 PING 10.10.40.17 (10.10.40.17) 56(84) bytes of data. 64 bytes from 10.10.40.17: icmp_req=1 ttl=64 time=0.533 ms --- 10.10.40.17 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 0.533/0.533/0.533/0.000 ms root@colossus09:~# ping -c1 10.10.40.9 PING 10.10.40.9 (10.10.40.9) 56(84) bytes of data. --- 10.10.40.9 ping statistics --- 1 packets transmitted, 0 received, 100% packet loss, time 0ms The first ping works well, that is 10.10.40.17 (a KVM instance directly on this host, vnet1 or vnet2), and then tested to ping the failing KVM instance with 10.10.40.9, which times out. root@colossus09:~# tcpdump -c 4 -n -i vnet0 tcpdump: WARNING: vnet0: no IPv4 address assigned tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on vnet0, link-type EN10MB (Ethernet), capture size 65535 bytes 10:44:19.514639 ARP, Request who-has 10.10.40.1 tell 10.10.40.9, length 28 10:44:20.514110 ARP, Request who-has 10.10.40.1 tell 10.10.40.9, length 28 10:44:21.902920 ARP, Request who-has 10.10.40.1 tell 10.10.40.9, length 28 10:44:22.902762 ARP, Request who-has 10.10.40.1 tell 10.10.40.9, length 28 4 packets captured 4 packets received by filter The above shows, that (I think) 10.10.40.9 wants to know the MAC of 10.10.40.1, but no one seems to answer, but II might misinterpret here. At least, someone is not answering. I can see the same ARP requests via tcpdump when inside the KVM instance (via VNC). What can I do to *fix* this? For me, this incident is major, since we just cannot add more production instances until we have fixed this. :-( Best regards, Christian. ** Affects: nova Importance: Undecided Status: New ** Affects: ubuntu Importance: Undecided Status: New ** Also affects: ubuntu Importance: Undecided Status: New -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1016848 Title: KVM instance stops communicating after some time To manage notifications about this bug go to: https://bugs.launchpad.net/nova/+bug/1016848/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs