Just had another crash, 7 days after my previous email. Exact same symptoms, this time with the latest version from CZ repository: 1.6.2-3~bpo8+1.
bird6 stuck on recvmsg using 100% CPU, getting EAGAIN in an infinite loop: # strace -p 23020 recvmsg(7, 0x7ffc45ae0ab0, 0) = -1 EAGAIN (Resource temporarily unavailable) recvmsg(7, 0x7ffc45ae0ab0, 0) = -1 EAGAIN (Resource temporarily unavailable) recvmsg(7, 0x7ffc45ae0ab0, 0) = -1 EAGAIN (Resource temporarily unavailable) recvmsg(7, 0x7ffc45ae0ab0, 0) = -1 EAGAIN (Resource temporarily unavailable) recvmsg(7, 0x7ffc45ae0ab0, 0) = -1 EAGAIN (Resource temporarily unavailable) [...] None of this happened in 1.5.0. What can I do to help troubleshoot this? This is a major regression and it's making me seriously concerned about both my edge routers using the same version of Bird. On 12/02/2016 06:46 PM, Israel G. Lugo wrote: > Hello, > > I am getting some random crashes in bird6, running on Debian, version > 1.6.2-1~bpo8+1 from your http://bird.network.cz/debian/ repository. > > I've got a single OSPF instance with 74 routes, one eBGP session > receiving a default route, and one iBGP session with another Bird > router, which sends me its own default. > > What happens is that, from time to time, bird6 becomes stuck in an > infinite loop doing recvmsg() on a netlink socket, and IPv6 routes are > lost. The interval seems random; it's been 3 days, and it's also been 2 > weeks. > > > gk1 # strace -p 11465 > recvmsg(7, 0x7ffe8cfecb70, 0) = -1 EAGAIN (Resource > temporarily unavailable) > recvmsg(7, 0x7ffe8cfecb70, 0) = -1 EAGAIN (Resource > temporarily unavailable) > recvmsg(7, 0x7ffe8cfecb70, 0) = -1 EAGAIN (Resource > temporarily unavailable) > recvmsg(7, 0x7ffe8cfecb70, 0) = -1 EAGAIN (Resource > temporarily unavailable) > recvmsg(7, 0x7ffe8cfecb70, 0) = -1 EAGAIN (Resource > temporarily unavailable) > recvmsg(7, 0x7ffe8cfecb70, 0) = -1 EAGAIN (Resource > temporarily unavailable) > recvmsg(7, 0x7ffe8cfecb70, 0) = -1 EAGAIN (Resource > temporarily unavailable) > recvmsg(7, 0x7ffe8cfecb70, 0) = -1 EAGAIN (Resource > temporarily unavailable) > recvmsg(7, 0x7ffe8cfecb70, 0) = -1 EAGAIN (Resource > temporarily unavailable) > recvmsg(7, 0x7ffe8cfecb70, 0) = -1 EAGAIN (Resource > temporarily unavailable) > recvmsg(7, 0x7ffe8cfecb70, 0) = -1 EAGAIN (Resource > temporarily unavailable) > [...] > > File descriptor 7 is a netlink socket: > > gk1 # lsof -p 11465 > COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME > bird6 11465 bird cwd DIR 253,0 4096 2 / > bird6 11465 bird rtd DIR 253,0 4096 2 / > bird6 11465 bird txt REG 253,0 540648 787381 > /usr/sbin/bird6 > bird6 11465 bird mem REG 253,0 47712 659204 > /lib/x86_64-linux-gnu/libnss_files-2.19.so > bird6 11465 bird mem REG 253,0 43592 659208 > /lib/x86_64-linux-gnu/libnss_nis-2.19.so > bird6 11465 bird mem REG 253,0 89104 659199 > /lib/x86_64-linux-gnu/libnsl-2.19.so > bird6 11465 bird mem REG 253,0 31632 659200 > /lib/x86_64-linux-gnu/libnss_compat-2.19.so > bird6 11465 bird mem REG 253,0 1738176 659160 > /lib/x86_64-linux-gnu/libc-2.19.so > bird6 11465 bird mem REG 253,0 137440 655379 > /lib/x86_64-linux-gnu/libpthread-2.19.so > bird6 11465 bird mem REG 253,0 140928 655799 > /lib/x86_64-linux-gnu/ld-2.19.so > bird6 11465 bird 0u CHR 1,3 0t0 1028 > /dev/null > bird6 11465 bird 1u CHR 1,3 0t0 1028 > /dev/null > bird6 11465 bird 2u CHR 1,3 0t0 1028 > /dev/null > bird6 11465 bird 3u unix 0xffff8803269f7c00 0t0 127941139 > socket > bird6 11465 bird 4u unix 0xffff8803269f7480 0t0 127941145 > /run/bird/bird6.ctl > bird6 11465 bird 5u netlink 0t0 127906248 > ROUTE > bird6 11465 bird 6u netlink 0t0 127906249 > ROUTE > bird6 11465 bird 7u netlink 0t0 127906250 > ROUTE > bird6 11465 bird 8u IPv6 127906251 0t0 TCP > *:bgp (LISTEN) > bird6 11465 bird 9u raw6 0t0 127906252 > 00000000000000000000000000000000:0059->00000000000000000000000000000000:0000 > st=07 > bird6 11465 bird 10u IPv6 127994711 0t0 TCP > e0.gk1:bgp->e0.gk2:39074 (CLOSE_WAIT) > bird6 11465 bird 11u IPv6 127965176 0t0 TCP > [2001:w:y:x::133]:58268->[2001:w:y:x::1]:bgp (CLOSE_WAIT) > > Unfortunately I didn't find any debug symbols for this package, so all I > could get from gdb was the following: > > (gdb) bt > #0 0x00007f5ad1705e80 in __recvmsg_nocancel () at > ../sysdeps/unix/syscall-template.S:81 > #1 0x00007f5ad1b90428 in ?? () > #2 0x00007f5ad1b8956b in ?? () > #3 0x00007f5ad1b8a06b in ?? () > #4 0x00007f5ad1b3f0c7 in ?? () > #5 0x00007f5ad136db45 in __libc_start_main (main=0x7f5ad1b3eb10, > argc=5, argv=0x7ffe8cfece28, init=<optimized out>, fini=<optimized out>, > rtld_fini=<optimized out>, stack_end=0x7ffe8cfece18) > at libc-start.c:287 > #6 0x00007f5ad1b3f3ec in ?? () > (gdb) info r > rax 0xfffffffffffffff5 -11 > rbx 0x7f5ad32aefe0 140028066590688 > rcx 0xffffffffffffffff -1 > rdx 0x0 0 > rsi 0x7ffe8cfecb70 140731263929200 > rdi 0x7 7 > rbp 0x7f5ad1dba270 0x7f5ad1dba270 > rsp 0x7ffe8cfecb18 0x7ffe8cfecb18 > r8 0x7f5ad32aefe0 140028066590688 > r9 0x0 0 > r10 0x1 1 > r11 0x246 582 > r12 0x0 0 > r13 0x7f5ad32c7f60 140028066692960 > r14 0x100 256 > r15 0x0 0 > rip 0x7f5ad1705e80 0x7f5ad1705e80 <__recvmsg_nocancel+7> > eflags 0x246 [ PF ZF IF ] > cs 0x33 51 > ss 0x2b 43 > ds 0x0 0 > es 0x0 0 > fs 0x0 0 > gs 0x0 0 > > > Unfortunately, I did not have debug on when this crashed. I had it on > for several days, but either I was "lucky" or the debug prevented the > crash somehow. I was having several MB worth of debug logs every day, so > I ended up disabling debug. > > I'm not 100% sure that this was installed from your CZ repository, it > may have been from Debian backports. But I'm 95% sure it came from CZ. > In any case the MD5 is as follows: > > 56e48e8e5a1380b384f1758df2077e53 bird_1.6.2-1~bpo8+1_amd64.deb > > I have now upgraded to 1.6.2-3~bpo8+1, from your CZ repository. > > I can provide the configuration file off-list, if that helps. > > Regards, > > Israel >
