On Wed, 30 Mar 2016 00:20:03 -0700 Michael Ma <make0...@gmail.com> wrote:
I know this might be an old topic so bare with me – what we are facing
is that applications are sending small packets using hundreds of
threads so the contention on spin lock in __dev_xmit_skb increases the
latency of dev_queue_xmit significantly. We’re building a network QoS
solution to avoid interference of different applications using HTB.
Yes, as you have noticed with HTB there is a single qdisc lock, and
congestion obviously happens :-)
It is possible with different tricks to make it scale. I believe
Google is using a variant of HTB, and it scales for them. They have
not open source their modifications to HTB (which likely also involves
a great deal of setup tricks).
If your purpose it to limit traffic/bandwidth per "cloud" instance,
then you can just use another TC setup structure. Like using MQ and
assigning a HTB per MQ queue (where the MQ queues are bound to each
CPU/HW queue)... But you have to figure out this setup yourself...
But in this case when some applications send massive small packets in
parallel, the application to be protected will get its throughput
affected (because it’s doing synchronous network communication using
multiple threads and throughput is sensitive to the increased latency)
Here is the profiling from perf:
- 67.57% iperf [kernel.kallsyms] [k] _spin_lock
- 99.94% dev_queue_xmit
- 96.91% _spin_lock
- 2.62% __qdisc_run
- 98.98% sch_direct_xmit
- 99.98% _spin_lock
As far as I understand the design of TC is to simplify locking schema
and minimize the work in __qdisc_run so that throughput won’t be
affected, especially with large packets. However if the scenario is
that multiple classes in the queueing discipline only have the shaping
limit, there isn’t really a necessary correlation between different
classes. The only synchronization point should be when the packet is
dequeued from the qdisc queue and enqueued to the transmit queue of
the device. My question is – is it worth investing on avoiding the
locking contention by partitioning the queue/lock so that this
scenario is addressed with relatively smaller latency?
Yes, there is a lot go gain, but it is not easy ;-)
I must have oversimplified a lot of details since I’m not familiar
with the TC implementation at this point – just want to get your input
in terms of whether this is a worthwhile effort or there is something
fundamental that I’m not aware of. If this is just a matter of quite
some additional work, would also appreciate helping to outline the
required work here.
Also would appreciate if there is any information about the latest
status of this work http://www.ijcset.com/docs/IJCSET13-04-04-113.pdf
This article seems to be very low quality... spelling errors, only 5
pages, no real code, etc.
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer