Currently txq/qdisc selection is based on flow hash so packets from the same flow will follow the order when they enter qdisc/txq, which avoids out-of-order problem.
To improve the concurrency of QoS algorithm we plan to have multiple per-cpu queues for a single TC class and do busy polling from a per-class thread to drain these queues. If we can do this frequently enough the out-of-order situation in this polling thread should not be that bad. To give more details - in the send path we introduce per-cpu per-class queues so that packets from the same class and same core will be enqueued to the same place. Then a per-class thread poll the queues belonging to its class from all the cpus and aggregate them into another per-class queue. This can effectively reduce contention but inevitably introduces potential out-of-order issue. Any concern/suggestion for working towards this direction?
