** Description changed: + [Description] - I have a pair of DHCP serevrs running in a cluster on ubuntu 20.04, All worked perfectly until recently, when they started stopping with code=killed, status=6/ABRT. - This is being fixed by + isc-dhcp-server uses libisc-export (coming from bin9-libs package) for handling the socket event(s) when configured in peer mode (master/secondary). It's possible that a sequence of messages dispatched by the master that requires acknowledgment from its peers holds a socket + in a pending to send state, a timer or a subsequent write request can be scheduled into this socket and the !sock->pending_send assertion + will be raised when trying to write again at the time data hasn't been flushed entirely and the pending_send flag hasn't been reset to 0 state. - https://bugs.launchpad.net/bugs/1870729 + If this race condition happens, the following stacktrace will be + hit: - However now one stops after a few hours with the following errors. One - can stay on line but not both. + (gdb) info threads + Id Target Id Frame + * 1 Thread 0x7fb4ddecb700 (LWP 3170) __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 + 2 Thread 0x7fb4dd6ca700 (LWP 3171) __lll_lock_wait (futex=futex@entry=0x7fb4de6d2028, private=0) at lowlevellock.c:52 + 3 Thread 0x7fb4de6cc700 (LWP 3169) futex_wake (private=<optimized out>, processes_to_wake=1, futex_word=<optimized out>) at ../sysdeps/nptl/futex-internal.h:364 + 4 Thread 0x7fb4de74f740 (LWP 3148) futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7fb4de6cd0d0) at ../sysdeps/nptl/futex-internal.h:183 + + (gdb) frame 2 + #2 0x00007fb4dec85985 in isc_assertion_failed (file=file@entry=0x7fb4decd8878 "../../../../lib/isc/unix/socket.c", line=line@entry=3361, type=type@entry=isc_assertiontype_insist, + cond=cond@entry=0x7fb4decda033 "!sock->pending_send") at ../../../lib/isc/assertions.c:52 + (gdb) bt + #1 0x00007fb4deaa7859 in __GI_abort () at abort.c:79 + #2 0x00007fb4dec85985 in isc_assertion_failed (file=file@entry=0x7fb4decd8878 "../../../../lib/isc/unix/socket.c", line=line@entry=3361, type=type@entry=isc_assertiontype_insist, + cond=cond@entry=0x7fb4decda033 "!sock->pending_send") at ../../../lib/isc/assertions.c:52 + #3 0x00007fb4decc17e1 in dispatch_send (sock=0x7fb4de6d4990) at ../../../../lib/isc/unix/socket.c:4041 + #4 process_fd (writeable=<optimized out>, readable=<optimized out>, fd=11, manager=0x7fb4de6d0010) at ../../../../lib/isc/unix/socket.c:4054 + #5 process_fds (writefds=<optimized out>, readfds=0x7fb4de6d1090, maxfd=13, manager=0x7fb4de6d0010) at ../../../../lib/isc/unix/socket.c:4211 + #6 watcher (uap=0x7fb4de6d0010) at ../../../../lib/isc/unix/socket.c:4397 + #7 0x00007fb4dea68609 in start_thread (arg=<optimized out>) at pthread_create.c:477 + #8 0x00007fb4deba4103 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 + + (gdb) frame 3 + #3 0x00007fb4decc17e1 in dispatch_send (sock=0x7fb4de6d4990) at ../../../../lib/isc/unix/socket.c:4041 + 4041 in ../../../../lib/isc/unix/socket.c + (gdb) p sock->pending_send + $2 = 1 + + [TEST CASE] + + 1) Install isc-dhcp-server in 2 focal machine(s). + 2) Configure peer/cluster mode as follows: + Primary configuration: https://pastebin.ubuntu.com/p/XYj648MghK/ + Secondary configuration: https://pastebin.ubuntu.com/p/PYkcshZCWK/ + 2) Run dhcpd as follows in both machine(s) + + # dhcpd -f -d -4 -cf /etc/dhcp/dhcpd.conf --no-pid ens4 + + 3) Leave the cluster running for a long (2h) period until the crash/race + condition is reproduced. + [REGRESSION POTENTIAL] - Syslog shows - Apr 10 17:20:15 dhcp-primary sh[6828]: ../../../../lib/isc/unix/socket.c:3361: INSIST(!sock->pending_send) failed, back trace - Apr 10 17:20:15 dhcp-primary sh[6828]: #0 0x7fbe78702a4a in ?? - Apr 10 17:20:15 dhcp-primary sh[6828]: #1 0x7fbe78702980 in ?? - Apr 10 17:20:15 dhcp-primary sh[6828]: #2 0x7fbe7873e7e1 in ?? - Apr 10 17:20:15 dhcp-primary sh[6828]: #3 0x7fbe784e5609 in ?? - Apr 10 17:20:15 dhcp-primary sh[6828]: #4 0x7fbe78621103 in ?? - - - nothing in kern.log - - - apport.log shows - ERROR: apport (pid 6850) Fri Apr 10 17:20:15 2020: called for pid 6828, signal 6, core limit 0, dump mode 2 - ERROR: apport (pid 6850) Fri Apr 10 17:20:15 2020: not creating core for pid with dump mode of 2 - ERROR: apport (pid 6850) Fri Apr 10 17:20:15 2020: executable: /usr/sbin/dhcpd (command line "dhcpd -user dhcpd -group dhcpd -f -4 -pf /run/dhcp-server/dhcpd.pid -cf /etc/dhcp/dhcpd.conf") - ERROR: apport (pid 6850) Fri Apr 10 17:20:15 2020: is_closing_session(): no DBUS_SESSION_BUS_ADDRESS in environment - ERROR: apport (pid 6850) Fri Apr 10 17:20:15 2020: wrote report /var/crash/_usr_sbin_dhcpd.0.crash - - - /var/crash/_usr_sbin_dhcpd.0.crash shows - - ProblemType: Crash - Architecture: amd64 - CrashCounter: 1 - Date: Fri Apr 10 17:20:15 2020 - DistroRelease: Ubuntu 20.04 - ExecutablePath: /usr/sbin/dhcpd - ExecutableTimestamp: 1586210315 - ProcCmdline: dhcpd -user dhcpd -group dhcpd -f -4 -pf /run/dhcp-server/dhcpd.pid -cf /etc/dhcp/dhcpd.conf - ProcEnviron: Error: [Errno 13] Permission denied: 'environ' - ProcMaps: Error: [Errno 13] Permission denied: 'maps' - ProcStatus: - Name: dhcpd - Umask: 0022 - State: D (disk sleep) - Tgid: 6828 - Ngid: 0 - Pid: 6828 - PPid: 1 - TracerPid: 0 - Uid: 113 113 113 113 - Gid: 118 118 118 118 - FDSize: 128 - Groups: - NStgid: 6828 - NSpid: 6828 - NSpgid: 6828 - NSsid: 6828 - VmPeak: 236244 kB - VmSize: 170764 kB - VmLck: 0 kB - VmPin: 0 kB - VmHWM: 12064 kB - VmRSS: 12064 kB - RssAnon: 5940 kB - RssFile: 6124 kB - RssShmem: 0 kB - VmData: 30792 kB - VmStk: 132 kB - VmExe: 592 kB - VmLib: 5424 kB - VmPTE: 76 kB - VmSwap: 0 kB - HugetlbPages: 0 kB - CoreDumping: 1 - THP_enabled: 1 - Threads: 4 - SigQ: 0/7609 - SigPnd: 0000000000000000 - ShdPnd: 0000000000000000 - SigBlk: 0000000000000000 - SigIgn: 0000000000001000 - SigCgt: 0000000180000000 - CapInh: 0000000000000000 - CapPrm: 0000000000000000 - CapEff: 0000000000000000 - CapBnd: 0000003fffffffff - CapAmb: 0000000000000000 - NoNewPrivs: 0 - Seccomp: 0 - Speculation_Store_Bypass: thread vulnerable - Cpus_allowed: 3 - Cpus_allowed_list: 0-1 - Mems_allowed: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,0000 - 0000,00000000,00000000,00000001 - Mems_allowed_list: 0 - voluntary_ctxt_switches: 111 - nonvoluntary_ctxt_switches: 144 - Signal: 6 - Uname: Linux 5.4.0-21-generic x86_64 - UserGroups: + * The fix will prevent the assertion to happen in the dispatch_send + path, later versions of isch-dhcp upstream lack this logic and entirely removed the existence of this flag. Therefore, removing the need for this + assertion at process_fd shouldn't be problematic.
** Description changed: [Description] - isc-dhcp-server uses libisc-export (coming from bin9-libs package) for handling the socket event(s) when configured in peer mode (master/secondary). It's possible that a sequence of messages dispatched by the master that requires acknowledgment from its peers holds a socket + isc-dhcp-server uses libisc-export (coming from bind9-libs package) for handling the socket event(s) when configured in peer mode (master/secondary). It's possible that a sequence of messages dispatched by the master that requires acknowledgment from its peers holds a socket in a pending to send state, a timer or a subsequent write request can be scheduled into this socket and the !sock->pending_send assertion will be raised when trying to write again at the time data hasn't been flushed entirely and the pending_send flag hasn't been reset to 0 state. If this race condition happens, the following stacktrace will be hit: (gdb) info threads - Id Target Id Frame + Id Target Id Frame * 1 Thread 0x7fb4ddecb700 (LWP 3170) __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 - 2 Thread 0x7fb4dd6ca700 (LWP 3171) __lll_lock_wait (futex=futex@entry=0x7fb4de6d2028, private=0) at lowlevellock.c:52 - 3 Thread 0x7fb4de6cc700 (LWP 3169) futex_wake (private=<optimized out>, processes_to_wake=1, futex_word=<optimized out>) at ../sysdeps/nptl/futex-internal.h:364 - 4 Thread 0x7fb4de74f740 (LWP 3148) futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7fb4de6cd0d0) at ../sysdeps/nptl/futex-internal.h:183 + 2 Thread 0x7fb4dd6ca700 (LWP 3171) __lll_lock_wait (futex=futex@entry=0x7fb4de6d2028, private=0) at lowlevellock.c:52 + 3 Thread 0x7fb4de6cc700 (LWP 3169) futex_wake (private=<optimized out>, processes_to_wake=1, futex_word=<optimized out>) at ../sysdeps/nptl/futex-internal.h:364 + 4 Thread 0x7fb4de74f740 (LWP 3148) futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7fb4de6cd0d0) at ../sysdeps/nptl/futex-internal.h:183 (gdb) frame 2 #2 0x00007fb4dec85985 in isc_assertion_failed (file=file@entry=0x7fb4decd8878 "../../../../lib/isc/unix/socket.c", line=line@entry=3361, type=type@entry=isc_assertiontype_insist, - cond=cond@entry=0x7fb4decda033 "!sock->pending_send") at ../../../lib/isc/assertions.c:52 + cond=cond@entry=0x7fb4decda033 "!sock->pending_send") at ../../../lib/isc/assertions.c:52 (gdb) bt #1 0x00007fb4deaa7859 in __GI_abort () at abort.c:79 #2 0x00007fb4dec85985 in isc_assertion_failed (file=file@entry=0x7fb4decd8878 "../../../../lib/isc/unix/socket.c", line=line@entry=3361, type=type@entry=isc_assertiontype_insist, - cond=cond@entry=0x7fb4decda033 "!sock->pending_send") at ../../../lib/isc/assertions.c:52 + cond=cond@entry=0x7fb4decda033 "!sock->pending_send") at ../../../lib/isc/assertions.c:52 #3 0x00007fb4decc17e1 in dispatch_send (sock=0x7fb4de6d4990) at ../../../../lib/isc/unix/socket.c:4041 #4 process_fd (writeable=<optimized out>, readable=<optimized out>, fd=11, manager=0x7fb4de6d0010) at ../../../../lib/isc/unix/socket.c:4054 #5 process_fds (writefds=<optimized out>, readfds=0x7fb4de6d1090, maxfd=13, manager=0x7fb4de6d0010) at ../../../../lib/isc/unix/socket.c:4211 #6 watcher (uap=0x7fb4de6d0010) at ../../../../lib/isc/unix/socket.c:4397 #7 0x00007fb4dea68609 in start_thread (arg=<optimized out>) at pthread_create.c:477 #8 0x00007fb4deba4103 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 (gdb) frame 3 #3 0x00007fb4decc17e1 in dispatch_send (sock=0x7fb4de6d4990) at ../../../../lib/isc/unix/socket.c:4041 4041 in ../../../../lib/isc/unix/socket.c (gdb) p sock->pending_send $2 = 1 [TEST CASE] 1) Install isc-dhcp-server in 2 focal machine(s). 2) Configure peer/cluster mode as follows: - Primary configuration: https://pastebin.ubuntu.com/p/XYj648MghK/ - Secondary configuration: https://pastebin.ubuntu.com/p/PYkcshZCWK/ + Primary configuration: https://pastebin.ubuntu.com/p/XYj648MghK/ + Secondary configuration: https://pastebin.ubuntu.com/p/PYkcshZCWK/ 2) Run dhcpd as follows in both machine(s) # dhcpd -f -d -4 -cf /etc/dhcp/dhcpd.conf --no-pid ens4 3) Leave the cluster running for a long (2h) period until the crash/race condition is reproduced. - [REGRESSION POTENTIAL] * The fix will prevent the assertion to happen in the dispatch_send path, later versions of isch-dhcp upstream lack this logic and entirely removed the existence of this flag. Therefore, removing the need for this assertion at process_fd shouldn't be problematic. -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1872118 Title: [SRU] DHCP Cluster crashes after a few hours To manage notifications about this bug go to: https://bugs.launchpad.net/dhcp/+bug/1872118/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs