Hi - I know this might be an old topic so bare with me – what we are facing is that applications are sending small packets using hundreds of threads so the contention on spin lock in __dev_xmit_skb increases the latency of dev_queue_xmit significantly. We’re building a network QoS solution to avoid interference of different applications using HTB. But in this case when some applications send massive small packets in parallel, the application to be protected will get its throughput affected (because it’s doing synchronous network communication using multiple threads and throughput is sensitive to the increased latency)
Here is the profiling from perf: - 67.57% iperf [kernel.kallsyms] [k] _spin_lock - 99.94% dev_queue_xmit 96.91% _spin_lock - 2.62% __qdisc_run - 98.98% sch_direct_xmit 99.98% _spin_lock 1.01% _spin_lock As far as I understand the design of TC is to simplify locking schema and minimize the work in __qdisc_run so that throughput won’t be affected, especially with large packets. However if the scenario is that multiple classes in the queueing discipline only have the shaping limit, there isn’t really a necessary correlation between different classes. The only synchronization point should be when the packet is dequeued from the qdisc queue and enqueued to the transmit queue of the device. My question is – is it worth investing on avoiding the locking contention by partitioning the queue/lock so that this scenario is addressed with relatively smaller latency? I must have oversimplified a lot of details since I’m not familiar with the TC implementation at this point – just want to get your input in terms of whether this is a worthwhile effort or there is something fundamental that I’m not aware of. If this is just a matter of quite some additional work, would also appreciate helping to outline the required work here. Also would appreciate if there is any information about the latest status of this work http://www.ijcset.com/docs/IJCSET13-04-04-113.pdf Thanks, Ke Ma