On Mon, 2017-11-13 at 14:47 -0800, Alexander Duyck wrote: > On Mon, Nov 13, 2017 at 10:17 AM, Michael Ma <make0...@gmail.com> wrote: > > 2017-11-12 16:14 GMT-08:00 Stephen Hemminger <step...@networkplumber.org>: > >> On Sun, 12 Nov 2017 13:43:13 -0800 > >> Michael Ma <make0...@gmail.com> wrote: > >> > >>> Any comments? We plan to implement this as a qdisc and appreciate any > >>> early feedback. > >>> > >>> Thanks, > >>> Michael > >>> > >>> > On Nov 9, 2017, at 5:20 PM, Michael Ma <make0...@gmail.com> wrote: > >>> > > >>> > Currently txq/qdisc selection is based on flow hash so packets from > >>> > the same flow will follow the order when they enter qdisc/txq, which > >>> > avoids out-of-order problem. > >>> > > >>> > To improve the concurrency of QoS algorithm we plan to have multiple > >>> > per-cpu queues for a single TC class and do busy polling from a > >>> > per-class thread to drain these queues. If we can do this frequently > >>> > enough the out-of-order situation in this polling thread should not be > >>> > that bad. > >>> > > >>> > To give more details - in the send path we introduce per-cpu per-class > >>> > queues so that packets from the same class and same core will be > >>> > enqueued to the same place. Then a per-class thread poll the queues > >>> > belonging to its class from all the cpus and aggregate them into > >>> > another per-class queue. This can effectively reduce contention but > >>> > inevitably introduces potential out-of-order issue. > >>> > > >>> > Any concern/suggestion for working towards this direction? > >> > >> In general, there is no meta design discussions in Linux development > >> Several developers have tried to do lockless > >> qdisc and similar things in the past. > >> > >> The devil is in the details, show us the code. > > > > Thanks for the response, Stephen. The code is fairly straightforward, > > we have a per-cpu per-class queue defined as this: > > > > struct bandwidth_group > > { > > struct skb_list queues[MAX_CPU_COUNT]; > > struct skb_list drain; > > } > > > > "drain" queue is used to aggregate per-cpu queues belonging to the > > same class. In the enqueue function, we determine the cpu where the > > packet is processed and enqueue it to the corresponding per-cpu queue: > > > > int cpu; > > struct bandwidth_group *bwg = &bw_rx_groups[bwgid]; > > > > cpu = get_cpu(); > > skb_list_append(&bwg->queues[cpu], skb); > > > > Here we don't check the flow of the packet so if there is task > > migration or multiple threads sending packets through the same flow we > > theoretically can have packets enqueued to different queues and > > aggregated to the "drain" queue out of order. > > > > Also AFAIK there is no lockless htb-like qdisc implementation > > currently, however if there is already similar effort ongoing please > > let me know. > > The question I would have is how would this differ from using XPS w/ > mqprio? Would this be a classful qdisc like HTB or a classless one > like mqprio? > > From what I can tell XPS would be able to get you your per-cpu > functionality, the benefit of it though would be that it would avoid > out-of-order issues for sockets originating on the local system. The > only thing I see as an issue right now is that the rate limiting with > mqprio is assumed to be handled via hardware due to mechanisms such as > DCB.
I think one of the key point was in : " do busy polling from a per-class thread to drain these queues." I mentioned this idea in TX path of : https://netdevconf.org/2.1/slides/apr6/dumazet-BUSY-POLLING-Netdev-2.1.pdf