On Mon, Apr 27, 2020 at 10:42:52AM +0200, Martin Pieuchot wrote: > On 21/04/20(Tue) 15:54, Jan Klemkow wrote: > > The following diff fixes a deadlock in em(4) when system runs out of > > mbuf(9)s. > > > > Tested with current on amd64 with: > > em0 at pci0 dev 25 function 0 "Intel I217-LM" rev 0x05: msi, mac 0x1e phy > > 0xb, address d0:50:99:c1:67:94 > > > > When the system runs out of mbufs, the receive ring of the em-NIC also > > run out of mbufs. If the driver puts an mbuf from the NICs receive ring > > with em_rxeof, it calls em_rxrefill to refill the ring. If its not > > possible to refill the ring and the receive ring is empty, it sets a > > timer to retry refilling later till new mbufs are available. > > So the current logic assumes that as long as the ring contains descriptors > em_rxrefill() failures aren't fatal and it is not necessary to schedule a > refill. Why?
I'm not sure. I just figured out, what I described below. > Unless we understand this it is difficult to reason about a fix. Have > you looked at other drivers, is it a common pattern? Drivers like iavf(4) and ixl(4) just set the timeout when the alive (if_rxr_inuse returns) counter is 0. Other drivers like ix(4) and bnx(4) call the timeout, when refill was not possible. I adjust the patch to the behavior of ix(4). (see below). > > In my case, the function em_rxeof is unable to pull the last one or two > > mbufs from the receive ring, because the descriptor status hasn't set > > the "descriptor done" flag (E1000_RXD_STAT_DD). Thus, the ring has one > > or two remaining mbufs, and the timeout to put new ones in is never set. > > When new mbufs are available later, the em driver stays in that > > situation. > > Are you saying there's always some descriptor on the ring? Did you > investigate why? Yes, with printf-debugging: I see that we return from em_rxeof() because the status bit E1000_RXD_STAT_DD is not set. And systat(8) shows one or two remaining mbufs in the ALIVE column. > > The diff always sets the timeout to put new mbufs in the receive ring, > > when it has lesser mbufs then the "low water mark". > > This changes the logic explained above. With it the fatal condition > changes from 0 to `lwm'. Another possible fix would be to always > schedule a timeout as soon as em_rxrefill() fails. I adjusted the patch to this behavior, as I mentioned above. > I'm not advocating for a particular change, I'm suggesting we understand > the current logic and its implications. The logic looks to me like: The NIC should first fill the last mbuf (descriptor), then we set a timeout to refill new mbufs (descriptors). It seems to me, that one or two mbufs maybe some kind of small to operate correctly?! But, as far as I understand the documentation of Intel, It should be find to deal with such a low amount of mbufs. I couldn't find any text that said, the firmware needs a minimum amount of DMA memory. > This bugs reminds me of sthen@'s change to be able to stability use NFS > on a port machine. I wonder if they have something in common. > > > A side effect is: when the mbufs in use (alive) are below then the "low > > water mark", the "current water mark" raises up to value of the "high > > water mark". When new mbufs are available the number of mbufs in use > > also raises to that value. Because, the "current water mark" just sinks > > during net livelocks. Thanks, Jan Index: dev/pci/if_em.c =================================================================== RCS file: /cvs/src/sys/dev/pci/if_em.c,v retrieving revision 1.351 diff -u -p -r1.351 if_em.c --- dev/pci/if_em.c 26 Apr 2020 20:49:56 -0000 1.351 +++ dev/pci/if_em.c 8 May 2020 08:49:40 -0000 @@ -2833,7 +2833,7 @@ em_rxrefill(void *arg) if (em_rxfill(que)) E1000_WRITE_REG(&sc->hw, RDT(que->me), que->rx.sc_rx_desc_head); - else if (if_rxr_inuse(&que->rx.sc_rx_ring) == 0) + else timeout_add(&que->rx_refill, 1); }