On Fri, 16 Mar 2018 14:05:00 -0700 Matthew Wilcox <wi...@infradead.org> wrote:
> I understand your concern about the cacheline bouncing between the > freeing and allocating CPUs. Is cross-CPU freeing a frequent > occurrence? From looking at its current usage, it seemed like the > allocation and freeing were usually on the same CPU. While we/the-network-stack in many cases try to alloc and free on the same CPU. Then, in practical default setups it will be common case to alloc and free on different CPUs. The scheduler moves processes between CPUs, and irqbalance change which CPU does the DMA TX completion (in case of forwarding). I usually pin/align the NIC IRQs manually (via proc smp_affinity_list) and manually pin/taskset the userspace process (and makes sure to test both local/remote alloc/free cases when benchmarking). I used to recommend people to pin the RX userspace process to the NAPI RX CPU, but based on my benchmarking I no longer do that. At least for UDP (after Paolo Abeni's optimizations) then there is a significant performance advantage of running UDP receiver on another CPU (in the range from 800Kpps to 2200Kpps). (Plus it avoids the softirq starvation problem). Mellanox even have a perf tuning tool, that explicit moves the DMA TX-completion IRQ to run on another CPU than RX. Thus, I assume that they have evidence/benchmarks that show this as an advantage. More recently I implemented XDP cpumap redirect. Which explicitly moves the raw page/frame to be handled on a remote CPU. Mostly to move another MM alloc/free overhead away from the RX-CPU, which is the SKB alloc/free overhead. I'm working on a XDP return frame API, but for now, performance depend on the page_frag recycle tricks (although for the sake of accuracy it doesn't directly depend on page_frag_cache API, but similar pagecnt_bias tricks). -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer