On Wed, 2017-08-23 at 15:49 -0700, Florian Fainelli wrote: > On 08/23/2017 03:26 PM, Eric Dumazet wrote: > > On Wed, 2017-08-23 at 13:02 -0700, Florian Fainelli wrote: > >> Hi, > >> > >> On Broadcom STB chips using bcmsysport.c and bcm_sf2.c we have an out of > >> band HW mechanism (not using per-flow pause frames) where we can have > >> the integrated network switch backpressure the CPU Ethernet controller > >> which translates in completing TX packets interrupts at the appropriate > >> pace and therefore get flow control applied end-to-end from the host CPU > >> port towards any downstream port. At least that is the premise and this > >> works reasonably well. > >> > >> This has a few drawbacks in that each of the bcmsysport TX queues need > >> to semi-statically map to their switch port output queues such that the > >> switch can calculate buffer occupancy and report congestion status, > >> which prompted this email [1] but this is tangential and is a policy not > >> a mechanism issue. > >> > >> [1]: https://www.spinics.net/lists/netdev/msg448153.html > >> > >> This is useful when your CPU / integrated switch links up at 1Gbits/sec > >> internally, and tries to push 1Gbits/sec worth of UDP traffic to e.g: a > >> downstream port linking at 100Mbits/sec, which could happen depending on > >> what you have connected to this device. > >> > >> Now the problem that I am facing, is the following: > >> > >> - net.core.wmem_default = 160KB (default value) > >> - using iperf -b 800M -u towards an iperf UDP server with the physical > >> link to that server established at 100Mbits/sec > >> - iperf does synchronous write(2) AFAICT so this gives it flow control > >> - using the default duration of 10s, you can barely see any packet loss > >> from one run to another > >> - the longer the run, the higher you are going to see some packet loss, > >> usually in the range of ~0.15% top > >> > >> The transmit flow looks like this: > >> > >> gphy (net/dsa/slave.c::dsa_slave_xmit, IFF_NO_QUEUE device) > >> -> eth0 (drivers/net/ethernet/broadcom/bcmsysport.c, "regular" network > >> device) > >> > >> I can clearly see that the network stack pushed N UDP packets (Udp and > >> Ip counters in /proc/net/snmp concur) however what the driver > >> transmitted and what the switch transmistted is N - M, and matches the > >> packet loss reported by the UDP server. I don't measure any SndbufErrors > >> which is not making sense yet. > >> > >> If I reduce the default socket size to say, 10x less than 160KB, 16KB, > >> then I either don't see any packet loss at 100Mbits/sec for 5 minutes or > >> more, or just very very little, down to 0.001%. Now if I repeat the > >> experiment with the physical link at 10Mbits/sec, same thing, the 16KB > >> wmem_default setting is no longer working and we need to lower the > >> socket write buffer size again. > >> > >> So what I am wondering is: > >> > >> - do I have an obvious flow control problem in my network driver that > >> usually does not lead to packet loss, but may sometime happen? > >> > >> - why would lowering the socket write size appear to masquerade or solve > >> this problem? > >> > >> I can consistently reproduce this across several kernel versions, 4.1, > >> 4.9 and latest net-next and therefore can also test patches. > >> > >> Thanks for reading thus far! > > > > Have you checked qdisc counters ? Maybe drops happen there. > > CONFIG_NET_SCHED is actually disabled in this kernel configuration. > > But even with that enabled, I don't see any drops being reported at the > qdisc level, see below.
You might try perf record -a -g -e skb:kfree_skb sleep 10 perf report