On Thu, 17 Nov 2016 08:21:19 -0800 Eric Dumazet <eric.duma...@gmail.com> wrote:
> On Thu, 2016-11-17 at 15:57 +0100, Jesper Dangaard Brouer wrote: > > On Thu, 17 Nov 2016 06:17:38 -0800 > > Eric Dumazet <eric.duma...@gmail.com> wrote: > > > > > On Thu, 2016-11-17 at 14:42 +0100, Jesper Dangaard Brouer wrote: > > > > > > > I can see that qdisc layer does not activate xmit_more in this case. > > > > > > > > > > Sure. Not enough pressure from the sender(s). > > > > > > The bottleneck is not the NIC or qdisc in your case, meaning that BQL > > > limit is kept at a small value. > > > > > > (BTW not all NIC have expensive doorbells) > > > > I believe this NIC mlx5 (50G edition) does. > > > > I'm seeing UDP TX of 1656017.55 pps, which is per packet: > > 2414 cycles(tsc) 603.86 ns > > > > Perf top shows (with my own udp_flood, that avoids __ip_select_ident): > > > > Samples: 56K of event 'cycles', Event count (approx.): 51613832267 > > Overhead Command Shared Object Symbol > > + 8.92% udp_flood [kernel.vmlinux] [k] _raw_spin_lock > > - _raw_spin_lock > > + 90.78% __dev_queue_xmit > > + 7.83% dev_queue_xmit > > + 1.30% ___slab_alloc > > + 5.59% udp_flood [kernel.vmlinux] [k] skb_set_owner_w > > Does TX completion happens on same cpu ? > > > + 4.77% udp_flood [mlx5_core] [k] mlx5e_sq_xmit > > + 4.09% udp_flood [kernel.vmlinux] [k] fib_table_lookup > > Why fib_table_lookup() is used with connected UDP flow ??? Don't know. I would be interested in hints howto avoid this!... I also see it with netperf, and my udp_flood code is here: https://github.com/netoptimizer/network-testing/blob/master/src/udp_flood.c > > + 4.00% swapper [mlx5_core] [k] mlx5e_poll_tx_cq > > + 3.11% udp_flood [kernel.vmlinux] [k] > > __ip_route_output_key_hash > > Same here, this is suspect. It is the function calling fib_table_lookup(), and it is called by ip_route_output_flow(). > > + 2.49% swapper [kernel.vmlinux] [k] __slab_free > > > > In this setup the spinlock in __dev_queue_xmit should be uncongested. > > An uncongested spin_lock+unlock cost 32 cycles(tsc) 8.198 ns on this system. > > > > But 8.92% of the time is spend on it, which corresponds to a cost of 215 > > cycles (2414*0.0892). This cost is too high, thus something else is > > going on... I claim this mysterious extra cost is the tailptr/doorbell. > > Well, with no pressure, doorbell is triggered for each packet. > > Since we can not predict that your application is going to send yet > another packet one usec later, we can not avoid this. The point is I can see a socket Send-Q forming, thus we do know the application have something to send. Thus, and possibility for non-opportunistic bulking. Allowing/implementing bulk enqueue from socket layer into qdisc layer, should be fairly simple (and rest of xmit_more is already in place). > Note that with the patches I am working on (busypolling extentions), > we could decide to let the busypoller doing the doorbells, say one every > 10 usec. (or after ~16 packets were queued) Sounds interesting! but an opportunistically approach. > But mlx4 uses two different NAPI for TX and RX, maybe mlx5 has the same > strategy . It is a possibility that TX completions were happening on another CPU (but I don't think so for mlx5). To rule that out, I reran the experiment making sure to pin everything to CPU-0 and the results are the same. sudo ethtool -L mlx5p2 combined 1 sudo sh -c '\ for x in /proc/irq/*/mlx5*/../smp_affinity; do \ echo 01 > $x; grep . -H $x; done \ ' $ taskset -c 0 ./udp_flood --sendto 198.18.50.1 --count $((10**9)) Perf report validating CPU in use: $ perf report -g --no-children --sort cpu,comm,dso,symbol --stdio --call-graph none # Overhead CPU Command Shared Object Symbol # ........ ... .............. ................... ......................................... # 9.97% 000 udp_flood [kernel.vmlinux] [k] _raw_spin_lock 4.37% 000 udp_flood [kernel.vmlinux] [k] fib_table_lookup 3.97% 000 udp_flood [mlx5_core] [k] mlx5e_sq_xmit 3.06% 000 udp_flood [kernel.vmlinux] [k] __ip_route_output_key_hash 2.51% 000 udp_flood [mlx5_core] [k] mlx5e_poll_tx_cq 2.48% 000 udp_flood [kernel.vmlinux] [k] copy_user_enhanced_fast_string 2.47% 000 udp_flood [kernel.vmlinux] [k] entry_SYSCALL_64 2.42% 000 udp_flood [kernel.vmlinux] [k] udp_sendmsg 2.39% 000 udp_flood [kernel.vmlinux] [k] __ip_append_data.isra.47 2.29% 000 udp_flood [kernel.vmlinux] [k] sock_def_write_space 2.19% 000 udp_flood [mlx5_core] [k] mlx5e_get_cqe 1.95% 000 udp_flood [kernel.vmlinux] [k] __ip_make_skb 1.94% 000 udp_flood [kernel.vmlinux] [k] __alloc_skb 1.90% 000 udp_flood [kernel.vmlinux] [k] sock_wfree 1.85% 000 udp_flood [kernel.vmlinux] [k] skb_set_owner_w 1.62% 000 udp_flood [kernel.vmlinux] [k] ip_finish_output2 1.61% 000 udp_flood [kernel.vmlinux] [k] kfree 1.54% 000 udp_flood [kernel.vmlinux] [k] entry_SYSCALL_64_fastpath 1.35% 000 udp_flood libc-2.17.so [.] __sendmsg_nocancel 1.26% 000 udp_flood [kernel.vmlinux] [k] __kmalloc_node_track_caller 1.24% 000 udp_flood [kernel.vmlinux] [k] __rcu_read_unlock 1.23% 000 udp_flood [kernel.vmlinux] [k] __local_bh_enable_ip 1.22% 000 udp_flood [kernel.vmlinux] [k] ip_idents_reserve -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer