On 21/04/20(Tue) 15:54, Jan Klemkow wrote:
> Hi,
> 
> The following diff fixes a deadlock in em(4) when system runs out of
> mbuf(9)s.
> 
> Tested with current on amd64 with:
> em0 at pci0 dev 25 function 0 "Intel I217-LM" rev 0x05: msi, mac 0x1e phy 
> 0xb, address d0:50:99:c1:67:94
> 
> When the system runs out of mbufs, the receive ring of the em-NIC also
> run out of mbufs.  If the driver puts an mbuf from the NICs receive ring
> with em_rxeof, it calls em_rxrefill to refill the ring.  If its not
> possible to refill the ring and the receive ring is empty, it sets a
> timer to retry refilling later till new mbufs are available.

So the current logic assumes that as long as the ring contains descriptors
em_rxrefill() failures aren't fatal and it is not necessary to schedule a
refill.  Why?

Unless we understand this it is difficult to reason about a fix.  Have
you looked at other drivers, is it a common pattern?

> In my case, the function em_rxeof is unable to pull the last one or two
> mbufs from the receive ring, because the descriptor status hasn't set
> the "descriptor done" flag (E1000_RXD_STAT_DD).  Thus, the ring has one
> or two remaining mbufs, and the timeout to put new ones in is never set.
> When new mbufs are available later, the em driver stays in that
> situation.

Are you saying there's always some descriptor on the ring?  Did you
investigate why?

> The diff always sets the timeout to put new mbufs in the receive ring,
> when it has lesser mbufs then the "low water mark".

This changes the logic explained above.  With it the fatal condition
changes from 0 to `lwm'.  Another possible fix would be to always
schedule a timeout as soon as em_rxrefill() fails.

I'm not advocating for a particular change, I'm suggesting we understand
the current logic and its implications.

This bugs reminds me of sthen@'s change to be able to stability use NFS
on a port machine.  I wonder if they have something in common.

> A side effect is: when the mbufs in use (alive) are below then the "low
> water mark", the "current water mark" raises up to value of the "high
> water mark".  When new mbufs are available the number of mbufs in use
> also raises to that value.  Because, the "current water mark" just sinks
> during net livelocks.
> 
> I'm sure this path is not the best solution.  Maybe someone else has a
> better idea to avoid that situation?!
> 
> Bye,
> Jan
> 
> Index: dev/pci/if_em.c
> ===================================================================
> RCS file: /cvs/src/sys/dev/pci/if_em.c,v
> retrieving revision 1.349
> diff -u -p -r1.349 if_em.c
> --- dev/pci/if_em.c   24 Mar 2020 09:30:06 -0000      1.349
> +++ dev/pci/if_em.c   21 Apr 2020 12:02:28 -0000
> @@ -2809,7 +2809,8 @@ em_rxrefill(void *arg)
>  
>       if (em_rxfill(que))
>               E1000_WRITE_REG(&sc->hw, RDT(que->me), que->rx.sc_rx_desc_head);
> -     else if (if_rxr_inuse(&que->rx.sc_rx_ring) == 0)
> +     else if (if_rxr_inuse(&que->rx.sc_rx_ring) <
> +         if_rxr_lwm(&que->rx.sc_rx_ring))
>               timeout_add(&que->rx_refill, 1);
>  }
>  
> Index: net/if_var.h
> ===================================================================
> RCS file: /cvs/src/sys/net/if_var.h,v
> retrieving revision 1.104
> diff -u -p -r1.104 if_var.h
> --- net/if_var.h      12 Apr 2020 07:02:43 -0000      1.104
> +++ net/if_var.h      21 Apr 2020 11:58:17 -0000
> @@ -389,6 +389,7 @@ u_int     if_rxr_get(struct if_rxring *, u_i
>  #define if_rxr_put(_r, _c)   do { (_r)->rxr_alive -= (_c); } while (0)
>  #define if_rxr_inuse(_r)     ((_r)->rxr_alive)
>  #define if_rxr_cwm(_r)               ((_r)->rxr_cwm)
> +#define if_rxr_lwm(_r)               ((_r)->rxr_lwm)
>  
>  int  if_rxr_info_ioctl(struct if_rxrinfo *, u_int, struct if_rxring_info *);
>  int  if_rxr_ioctl(struct if_rxrinfo *, const char *, u_int,
> 

Reply via email to