On Mon, 28 Nov 2016 11:52:38 +0100 Paolo Abeni <pab...@redhat.com> wrote:
> Hi Jesper, > > On Fri, 2016-11-25 at 18:37 +0100, Jesper Dangaard Brouer wrote: > > > The measured performance delta is as follow: > > > > > > before after > > > (Kpps) (Kpps) > > > > > > udp flood[1] 570 1800(+215%) > > > max tput[2] 1850 3500(+89%) > > > single queue[3] 1850 1630(-11%) > > > > > > [1] line rate flood using multiple 64 bytes packets and multiple flows > > > > Is [1] sending multiple flow in the a single UDP-sink? > > Yes, in the test scenario [1] there are multiple UDP flows using 16 > different rx queues on the receiver host, and a single user space > reader. > > > > [2] like [1], but using the minimum number of flows to saturate the user > > > space > > > sink, that is 1 flow for the old kernel and 3 for the patched one. > > > the tput increases since the contention on the rx lock is low. > > > [3] like [1] but using a single flow with both old and new kernel. All the > > > packets land on the same rx queue and there is a single ksoftirqd > > > instance > > > running > > > > It is important to know, if ksoftirqd and the UDP-sink runs on the same > > CPU? > > No pinning is enforced. The scheduler moves the user space process on a > different cpu in respect to the ksoftriqd kernel thread. This floating userspace process can cause a high variation between test runs. On my system, the performance drops to approx 600Kpps when ksoftirqd and udp_sink share the same CPU. Quick run with your patches applied: Sender: pktgen with big packets ./pktgen_sample03_burst_single_flow.sh -i mlx5p2 -d 198.18.50.1 \ -m 7c:fe:90:c7:b1:cf -t1 -b128 -s 1472 Forced CPU0 for both ksoftirq and udp_sink # taskset -c 0 ./udp_sink --count $((10**7)) --port 9 --repeat 1 ns pps cycles recvMmsg/32 run: 0 10000000 1667.93 599547.16 6685 recvmsg run: 0 10000000 1810.70 552273.39 7257 read run: 0 10000000 1634.72 611723.95 6552 recvfrom run: 0 10000000 1585.06 630891.39 6353 > > > The regression in the single queue scenario is actually due to the > > > improved > > > performance of the recvmmsg() syscall: the user space process is now > > > significantly faster than the ksoftirqd process so that the latter needs > > > often > > > to wake up the user space process. > > > > When measuring these things, make sure that we/you measure both the packets > > actually received in the userspace UDP-sink, and also measure packets > > RX processed by ksoftirq (and I often also look at what HW got delivered). > > Some times, when userspace is too slow, the kernel can/will drop packets. > > > > It is actually quite easily verified with cmdline: > > > > nstat > /dev/null && sleep 1 && nstat > > > > For HW measurements I use the tool ethtool_stats.pl: > > > > https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl > > > > We collected the UDP stats for all the three scenarios; we have lot of > drop in test[1] and little, by design, in test[2]. In test [3], with the > patched kernel, the drops are 0: ksoftirqd is way slower than the user > space sink. > > > > Since ksoftirqd is the bottle-neck is such scenario, overall this causes a > > > tput reduction. In a real use case, where the udp sink is performing some > > > actual processing of the received data, such regression is unlikely to > > > really > > > have an effect. > > > > My experience is that the performance of RX UDP is affected by: > > * if socket is connected or not (yes, RX side also) > > * state of /proc/sys/net/ipv4/ip_early_demux > > > > You don't need to run with all the combinations, but it would be nice > > if you specify what config your have based your measurements on (and > > keep them stable in your runs). > > > > I've actually implemented the "--connect" option to my udp_sink > > program[1] today, but I've not pushed it yet, if you are interested. > > The reported numbers are all gathered with unconnected sockets and early > demux enabled. > > We also used connected socket for test[3], with relative little > difference (the tput increased for both unpatched and patched kernel, > and the difference was roughly the same). When I use connected sockets (RX side) and ip_early_demux enabled, I do see a performance boost for recvmmsg. With these patches applied, forced ksoftirqd on CPU0 and udp_sink on CPU2, pktgen single flow sending size 1472 bytes. $ sysctl net/ipv4/ip_early_demux net.ipv4.ip_early_demux = 1 $ grep -H . /proc/sys/net/core/{r,w}mem_max /proc/sys/net/core/rmem_max:1048576 /proc/sys/net/core/wmem_max:1048576 # taskset -c 2 ./udp_sink --count $((10**7)) --port 9 --repeat 1 # ns pps cycles recvMmsg/32 run: 0 10000000 462.51 2162095.23 1853 recvmsg run: 0 10000000 536.47 1864041.75 2150 read run: 0 10000000 492.01 2032460.71 1972 recvfrom run: 0 10000000 553.94 1805262.84 2220 # taskset -c 2 ./udp_sink --count $((10**7)) --port 9 --repeat 1 --connect # ns pps cycles recvMmsg/32 run: 0 10000000 405.15 2468225.03 1623 recvmsg run: 0 10000000 548.23 1824049.58 2197 read run: 0 10000000 489.76 2041825.27 1962 recvfrom run: 0 10000000 466.18 2145091.77 1868 My theory is that by enabling connect'ed RX socket, the ksoftirqd gets faster (no fib_lookup) and is no-longer a bottleneck. This is confirmed by the nstat output below. Below: unconnected $ nstat > /dev/null && sleep 1 && nstat #kernel IpInReceives 2143944 0.0 IpInDelivers 2143945 0.0 UdpInDatagrams 2143944 0.0 IpExtInOctets 3125889306 0.0 IpExtInNoECTPkts 2143956 0.0 Below: connected $ nstat > /dev/null && sleep 1 && nstat #kernel IpInReceives 2925155 0.0 IpInDelivers 2925156 0.0 UdpInDatagrams 2440925 0.0 UdpInErrors 484230 0.0 UdpRcvbufErrors 484230 0.0 IpExtInOctets 4264896402 0.0 IpExtInNoECTPkts 2925170 0.0 This is a 50Gbit/s link, and IpInReceives correspondent to approx 35Gbit/s. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer