On 21/04/20(Tue) 15:54, Jan Klemkow wrote: > Hi, > > The following diff fixes a deadlock in em(4) when system runs out of > mbuf(9)s. > > Tested with current on amd64 with: > em0 at pci0 dev 25 function 0 "Intel I217-LM" rev 0x05: msi, mac 0x1e phy > 0xb, address d0:50:99:c1:67:94 > > When the system runs out of mbufs, the receive ring of the em-NIC also > run out of mbufs. If the driver puts an mbuf from the NICs receive ring > with em_rxeof, it calls em_rxrefill to refill the ring. If its not > possible to refill the ring and the receive ring is empty, it sets a > timer to retry refilling later till new mbufs are available.
So the current logic assumes that as long as the ring contains descriptors em_rxrefill() failures aren't fatal and it is not necessary to schedule a refill. Why? Unless we understand this it is difficult to reason about a fix. Have you looked at other drivers, is it a common pattern? > In my case, the function em_rxeof is unable to pull the last one or two > mbufs from the receive ring, because the descriptor status hasn't set > the "descriptor done" flag (E1000_RXD_STAT_DD). Thus, the ring has one > or two remaining mbufs, and the timeout to put new ones in is never set. > When new mbufs are available later, the em driver stays in that > situation. Are you saying there's always some descriptor on the ring? Did you investigate why? > The diff always sets the timeout to put new mbufs in the receive ring, > when it has lesser mbufs then the "low water mark". This changes the logic explained above. With it the fatal condition changes from 0 to `lwm'. Another possible fix would be to always schedule a timeout as soon as em_rxrefill() fails. I'm not advocating for a particular change, I'm suggesting we understand the current logic and its implications. This bugs reminds me of sthen@'s change to be able to stability use NFS on a port machine. I wonder if they have something in common. > A side effect is: when the mbufs in use (alive) are below then the "low > water mark", the "current water mark" raises up to value of the "high > water mark". When new mbufs are available the number of mbufs in use > also raises to that value. Because, the "current water mark" just sinks > during net livelocks. > > I'm sure this path is not the best solution. Maybe someone else has a > better idea to avoid that situation?! > > Bye, > Jan > > Index: dev/pci/if_em.c > =================================================================== > RCS file: /cvs/src/sys/dev/pci/if_em.c,v > retrieving revision 1.349 > diff -u -p -r1.349 if_em.c > --- dev/pci/if_em.c 24 Mar 2020 09:30:06 -0000 1.349 > +++ dev/pci/if_em.c 21 Apr 2020 12:02:28 -0000 > @@ -2809,7 +2809,8 @@ em_rxrefill(void *arg) > > if (em_rxfill(que)) > E1000_WRITE_REG(&sc->hw, RDT(que->me), que->rx.sc_rx_desc_head); > - else if (if_rxr_inuse(&que->rx.sc_rx_ring) == 0) > + else if (if_rxr_inuse(&que->rx.sc_rx_ring) < > + if_rxr_lwm(&que->rx.sc_rx_ring)) > timeout_add(&que->rx_refill, 1); > } > > Index: net/if_var.h > =================================================================== > RCS file: /cvs/src/sys/net/if_var.h,v > retrieving revision 1.104 > diff -u -p -r1.104 if_var.h > --- net/if_var.h 12 Apr 2020 07:02:43 -0000 1.104 > +++ net/if_var.h 21 Apr 2020 11:58:17 -0000 > @@ -389,6 +389,7 @@ u_int if_rxr_get(struct if_rxring *, u_i > #define if_rxr_put(_r, _c) do { (_r)->rxr_alive -= (_c); } while (0) > #define if_rxr_inuse(_r) ((_r)->rxr_alive) > #define if_rxr_cwm(_r) ((_r)->rxr_cwm) > +#define if_rxr_lwm(_r) ((_r)->rxr_lwm) > > int if_rxr_info_ioctl(struct if_rxrinfo *, u_int, struct if_rxring_info *); > int if_rxr_ioctl(struct if_rxrinfo *, const char *, u_int, >