On Wed, Oct 04, 2023 at 11:07:24PM +0200, Kirill Miazine wrote:
> 
> 
> • Vitaliy Makkoveev [2023-10-04 22:03]:
> > On Wed, Oct 04, 2023 at 09:13:59PM +0200, Alexander Bluhm wrote:
> > > On Wed, Oct 04, 2023 at 08:42:48PM +0200, Kirill Miazine wrote:
> > > > > If it happns again, could you send an 'ps axlww | grep ifconifg'
> > > > > output?  Then we see the wait channel where it hangs in the kernel.
> > > > > 
> > > > > $ ps axlww
> > > > >     UID   PID  PPID CPU PRI  NI   VSZ   RSS WCHAN   STAT   TT       
> > > > > TIME COMMAND
> > > > 
> > > > Here it happened again:
> > > > 
> > > >       0 75339 23922   0  10   0   360   296 wg_ifq  D+U    p0    0:00.00
> > > > ifconfig wg1 destroy
> > > 
> > > wg_peer_destroy()
> > >   ...
> > >          NET_LOCK();
> > >          while (!ifq_empty(&sc->sc_if.if_snd)) {
> > >                  NET_UNLOCK();
> > >                  tsleep_nsec(sc, PWAIT, "wg_ifq", 1000);
> > >                  NET_LOCK();
> > >          }
> > >          NET_UNLOCK();
> > > 
> > > This net lock dance looks fishy.  And the sleep has a timeout of 1
> > > milli second.  But that is may be per packet.  So if you have a
> > > long queue or the queue refills somehow, it will take forever.
> > > 
> > > I think the difference in the usage is constant traffic that keeps
> > > the send queue full.  The timeout hides the problem when there are
> > > only a few packets.
> > > 
> > 
> > This should ensure wg_qstart() will not dereference dying `peer'. Looks
> > crappy and potentially could block forever, but should work. However
> > netlock it unnecessary here. netlocked wg_output() could fill `if_snd'
> > while netlock released before tsleep(), so it serializes nothing but
> > stops packets processing.
> > 
> > Kirill, does this diff help?
> 
> nope, same hang.
> 
> tested on a fresh Vultr VM with -current and patch below. VM got added to my
> normal WG network, and VM was accessed by SSH over that WG network.
> 
> then:
> 
>  # ifconfig wg1 down (from ssh -- connection to ssh session disappears)
>  # ifconfig wg1 delete (from console)
>  # ifconfig wg1 destroy" (from console -- command hangs)
> 
> interestingly, destroy works fine from ssh when commands are entered in a
> tmux session and executed immediately after each other:
> 
>   # ifconfig wg1 down; ifconfig wg1 delete; ifconfig wg1 destroy
> 
> looks like a timing issue.
> 

Looks like packet stook in `if_snd'. Hypothetically this hack should
help. Please note, even it works, I don't want to commit it. Someone
should introduce reference counter to wg_peer and remove this crap from
wg_peer_destroy().

Index: sys/net/if_wg.c
===================================================================
RCS file: /cvs/src/sys/net/if_wg.c,v
retrieving revision 1.31
diff -u -p -r1.31 if_wg.c
--- sys/net/if_wg.c     26 Sep 2023 15:16:44 -0000      1.31
+++ sys/net/if_wg.c     4 Oct 2023 21:21:40 -0000
@@ -507,13 +507,8 @@ wg_peer_destroy(struct wg_peer *peer)
 
        noise_remote_clear(&peer->p_remote);
 
-       NET_LOCK();
-       while (!ifq_empty(&sc->sc_if.if_snd)) {
-               NET_UNLOCK();
+       while (!ifq_empty(&sc->sc_if.if_snd))
                tsleep_nsec(sc, PWAIT, "wg_ifq", 1000);
-               NET_LOCK();
-       }
-       NET_UNLOCK();
 
        taskq_barrier(wg_crypt_taskq);
        taskq_barrier(net_tq(sc->sc_if.if_index));
@@ -2580,6 +2575,7 @@ wg_down(struct wg_softc *sc)
        wg_unbind(sc);
        rw_exit_read(&sc->sc_lock);
        NET_LOCK();
+       ifq_purge(&sc->sc_if.if_snd);
 }
 
 int

Reply via email to