On Mon, Apr 27, 2020 at 10:42:52AM +0200, Martin Pieuchot wrote:
> On 21/04/20(Tue) 15:54, Jan Klemkow wrote:
> > The following diff fixes a deadlock in em(4) when system runs out of
> > mbuf(9)s.
> > 
> > Tested with current on amd64 with:
> > em0 at pci0 dev 25 function 0 "Intel I217-LM" rev 0x05: msi, mac 0x1e phy 
> > 0xb, address d0:50:99:c1:67:94
> > 
> > When the system runs out of mbufs, the receive ring of the em-NIC also
> > run out of mbufs.  If the driver puts an mbuf from the NICs receive ring
> > with em_rxeof, it calls em_rxrefill to refill the ring.  If its not
> > possible to refill the ring and the receive ring is empty, it sets a
> > timer to retry refilling later till new mbufs are available.
> 
> So the current logic assumes that as long as the ring contains descriptors
> em_rxrefill() failures aren't fatal and it is not necessary to schedule a
> refill.  Why?

I'm not sure.  I just figured out, what I described below.

> Unless we understand this it is difficult to reason about a fix.  Have
> you looked at other drivers, is it a common pattern?

Drivers like iavf(4) and ixl(4) just set the timeout when the alive
(if_rxr_inuse returns) counter is 0.  Other drivers like ix(4) and
bnx(4) call the timeout, when refill was not possible.

I adjust the patch to the behavior of ix(4).  (see below).

> > In my case, the function em_rxeof is unable to pull the last one or two
> > mbufs from the receive ring, because the descriptor status hasn't set
> > the "descriptor done" flag (E1000_RXD_STAT_DD).  Thus, the ring has one
> > or two remaining mbufs, and the timeout to put new ones in is never set.
> > When new mbufs are available later, the em driver stays in that
> > situation.
> 
> Are you saying there's always some descriptor on the ring?  Did you
> investigate why?

Yes, with printf-debugging:  I see that we return from em_rxeof()
because the status bit E1000_RXD_STAT_DD is not set.  And systat(8)
shows one or two remaining mbufs in the ALIVE column.

> > The diff always sets the timeout to put new mbufs in the receive ring,
> > when it has lesser mbufs then the "low water mark".
> 
> This changes the logic explained above.  With it the fatal condition
> changes from 0 to `lwm'.  Another possible fix would be to always
> schedule a timeout as soon as em_rxrefill() fails.

I adjusted the patch to this behavior, as I mentioned above.

> I'm not advocating for a particular change, I'm suggesting we understand
> the current logic and its implications.

The logic looks to me like: The NIC should first fill the last mbuf
(descriptor), then we set a timeout to refill new mbufs (descriptors).
It seems to me, that one or two mbufs maybe some kind of small to
operate correctly?!

But, as far as I understand the documentation of Intel, It should be
find to deal with such a low amount of mbufs.  I couldn't find any text
that said, the firmware needs a minimum amount of DMA memory.

> This bugs reminds me of sthen@'s change to be able to stability use NFS
> on a port machine.  I wonder if they have something in common.
> 
> > A side effect is: when the mbufs in use (alive) are below then the "low
> > water mark", the "current water mark" raises up to value of the "high
> > water mark".  When new mbufs are available the number of mbufs in use
> > also raises to that value.  Because, the "current water mark" just sinks
> > during net livelocks.

Thanks,
Jan

Index: dev/pci/if_em.c
===================================================================
RCS file: /cvs/src/sys/dev/pci/if_em.c,v
retrieving revision 1.351
diff -u -p -r1.351 if_em.c
--- dev/pci/if_em.c     26 Apr 2020 20:49:56 -0000      1.351
+++ dev/pci/if_em.c     8 May 2020 08:49:40 -0000
@@ -2833,7 +2833,7 @@ em_rxrefill(void *arg)
 
        if (em_rxfill(que))
                E1000_WRITE_REG(&sc->hw, RDT(que->me), que->rx.sc_rx_desc_head);
-       else if (if_rxr_inuse(&que->rx.sc_rx_ring) == 0)
+       else
                timeout_add(&que->rx_refill, 1);
 }

Reply via email to