(I accidentally dropped netdev on my earlier message... here is Eric's response, which also didn't go to the group)
---------- Forwarded message --------- From: Eric Dumazet <eric.duma...@gmail.com> Date: Mon, Sep 30, 2019 at 6:53 PM Subject: Re: BUG: sk_backlog.len can overestimate To: John Ousterhout <ous...@cs.stanford.edu> On 9/30/19 5:41 PM, John Ousterhout wrote: > On Mon, Sep 30, 2019 at 5:14 PM Eric Dumazet <eric.duma...@gmail.com> wrote: >> >> >> >> On 9/30/19 4:58 PM, John Ousterhout wrote: >>> As of 4.16.10, it appears to me that sk->sk_backlog_len does not >>> provide an accurate estimate of backlog length; this reduces the >>> usefulness of the "limit" argument to sk_add_backlog. >>> >>> The problem is that, under heavy load, sk->sk_backlog_len can grow >>> arbitrarily large, even though the actual amount of data in the >>> backlog is small. This happens because __release_sock doesn't reset >>> the backlog length until it gets completely caught up. Under heavy >>> load, new packets can be arriving continuously into the backlog >>> (which increases sk_backlog.len) while other packets are being >>> serviced. This can go on forever, so sk_backlog.len never gets reset >>> and it can become arbitrarily large. >> >> Certainly not. >> >> It can not grow arbitrarily large, unless a backport gone wrong maybe. > > Can you help me understand what would limit the growth of this value? > Suppose that new packets are arriving as quickly as they are > processed. Every time __release_sock calls sk_backlog_rcv, a new > packet arrives during the call, which is added to the backlog, > incrementing sk_backlog.len. However, sk_backlog_len doesn't get > decreased when sk_backlog_rcv completes, since the backlog hasn't > emptied (as you said, it's not "safe"). As a result, sk_backlog.len > has increased, but the actual backlog length is unchanged (one packet > was added, one was removed). Why can't this process repeat > indefinitely, until eventually sk_backlog.len reaches whatever limit > the transport specifies when it invokes sk_add_backlog? At this point > packets will be dropped by the transport even though the backlog isn't > actually very large. The process is bounded by socket sk_rcvbuf + sk_sndbuf bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb) { u32 limit = sk->sk_rcvbuf + sk->sk_sndbuf; ... if (unlikely(sk_add_backlog(sk, skb, limit))) { ... __NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPBACKLOGDROP); ... } Once the limit is reached, sk_backlog.len wont be touched, unless __release_sock() has processed the whole queue. > >>> >>> Because of this, the "limit" argument to sk_add_backlog may not be >>> useful, since it could result in packets being discarded even though >>> the backlog is not very large. >>> >> >> >> You will have to study git log/history for the details, the limit _is_ >> useful, >> and we reset the limit in __release_sock() only when _safe_. >> >> Assuming you talk about TCP, then I suggest you use a more recent kernel. >> >> linux-5.0 got coalescing in the backlog queue, which helped quite a bit.