On Wed, Oct 04, 2023 at 11:07:24PM +0200, Kirill Miazine wrote: > > > • Vitaliy Makkoveev [2023-10-04 22:03]: > > On Wed, Oct 04, 2023 at 09:13:59PM +0200, Alexander Bluhm wrote: > > > On Wed, Oct 04, 2023 at 08:42:48PM +0200, Kirill Miazine wrote: > > > > > If it happns again, could you send an 'ps axlww | grep ifconifg' > > > > > output? Then we see the wait channel where it hangs in the kernel. > > > > > > > > > > $ ps axlww > > > > > UID PID PPID CPU PRI NI VSZ RSS WCHAN STAT TT > > > > > TIME COMMAND > > > > > > > > Here it happened again: > > > > > > > > 0 75339 23922 0 10 0 360 296 wg_ifq D+U p0 0:00.00 > > > > ifconfig wg1 destroy > > > > > > wg_peer_destroy() > > > ... > > > NET_LOCK(); > > > while (!ifq_empty(&sc->sc_if.if_snd)) { > > > NET_UNLOCK(); > > > tsleep_nsec(sc, PWAIT, "wg_ifq", 1000); > > > NET_LOCK(); > > > } > > > NET_UNLOCK(); > > > > > > This net lock dance looks fishy. And the sleep has a timeout of 1 > > > milli second. But that is may be per packet. So if you have a > > > long queue or the queue refills somehow, it will take forever. > > > > > > I think the difference in the usage is constant traffic that keeps > > > the send queue full. The timeout hides the problem when there are > > > only a few packets. > > > > > > > This should ensure wg_qstart() will not dereference dying `peer'. Looks > > crappy and potentially could block forever, but should work. However > > netlock it unnecessary here. netlocked wg_output() could fill `if_snd' > > while netlock released before tsleep(), so it serializes nothing but > > stops packets processing. > > > > Kirill, does this diff help? > > nope, same hang. > > tested on a fresh Vultr VM with -current and patch below. VM got added to my > normal WG network, and VM was accessed by SSH over that WG network. > > then: > > # ifconfig wg1 down (from ssh -- connection to ssh session disappears) > # ifconfig wg1 delete (from console) > # ifconfig wg1 destroy" (from console -- command hangs) > > interestingly, destroy works fine from ssh when commands are entered in a > tmux session and executed immediately after each other: > > # ifconfig wg1 down; ifconfig wg1 delete; ifconfig wg1 destroy > > looks like a timing issue. >
Looks like packet stook in `if_snd'. Hypothetically this hack should help. Please note, even it works, I don't want to commit it. Someone should introduce reference counter to wg_peer and remove this crap from wg_peer_destroy(). Index: sys/net/if_wg.c =================================================================== RCS file: /cvs/src/sys/net/if_wg.c,v retrieving revision 1.31 diff -u -p -r1.31 if_wg.c --- sys/net/if_wg.c 26 Sep 2023 15:16:44 -0000 1.31 +++ sys/net/if_wg.c 4 Oct 2023 21:21:40 -0000 @@ -507,13 +507,8 @@ wg_peer_destroy(struct wg_peer *peer) noise_remote_clear(&peer->p_remote); - NET_LOCK(); - while (!ifq_empty(&sc->sc_if.if_snd)) { - NET_UNLOCK(); + while (!ifq_empty(&sc->sc_if.if_snd)) tsleep_nsec(sc, PWAIT, "wg_ifq", 1000); - NET_LOCK(); - } - NET_UNLOCK(); taskq_barrier(wg_crypt_taskq); taskq_barrier(net_tq(sc->sc_if.if_index)); @@ -2580,6 +2575,7 @@ wg_down(struct wg_softc *sc) wg_unbind(sc); rw_exit_read(&sc->sc_lock); NET_LOCK(); + ifq_purge(&sc->sc_if.if_snd); } int