Thanks a lot for your reply! It was really helpful. I have a few comments, please see below.
On 2020-06-24 23:21, Samudrala, Sridhar wrote: > > > On 6/17/2020 6:15 AM, Maxim Mikityanskiy wrote: >> Hi, >> >> I discovered Intel ADQ feature [1] that allows to boost performance by >> picking dedicated queues for application traffic. We did some >> research, and I got some level of understanding how it works, but I >> have some questions, and I hope you could answer them. >> >> 1. SO_INCOMING_NAPI_ID usage. In my understanding, every connection >> has a key (sk_napi_id) that is unique to the NAPI where this >> connection is handled, and the application uses that key to choose a >> handler thread from the thread pool. If we have a one-to-one >> relationship between application threads and NAPI IDs of connections, >> each application thread will handle only traffic from a single NAPI. >> Is my understanding correct? > > Yes. It is correct and recommended with the current implementation. > >> >> 1.1. I wonder how the application thread gets scheduled on the same >> core that NAPI runs at. It currently only works with busy_poll, so >> when the application initiates busy polling (calls epoll), does the >> Linux scheduler move the thread to the right CPU? Do we have to have a >> strict one-to-one relationship between threads and NAPIs, or can one >> thread handle multiple NAPIs? When the data arrives, does the >> scheduler run the application thread on the same CPU that NAPI ran on? > > The app thread can do busypoll from any core and there is no requirement > that the scheduler needs to move the thread to a specific CPU. > > If the NAPI processing happens via interrupts, the scheduler could move > the app thread to the same CPU that NAPI ran on. > >> >> 1.2. I see that SO_INCOMING_NAPI_ID is tightly coupled with busy_poll. >> It is enabled only if CONFIG_NET_RX_BUSY_POLL is set. Is there a real >> reason why it can't be used without busy_poll? In other words, if we >> modify the kernel to drop this requirement, will the kernel still >> schedule the application thread on the same CPU as NAPI when busy_poll >> is not used? > > It should be OK to remove this restriction, but requires enabling this > in skb_mark_napi_id() and sk_mark_napi_id() too. > >> >> 2. Can you compare ADQ to aRFS+XPS? aRFS provides a way to steer >> traffic to the application's CPU in an automatic fashion, and xps_rxqs >> can be used to transmit from the corresponding queues. This setup >> doesn't need manual configuration of TCs and is not limited to 4 >> applications. The difference of ADQ is that (in my understanding) it >> moves the application to the RX CPU, while aRFS steers the traffic to >> the RX queue handled my the application's CPU. Is there any advantage >> of ADQ over aRFS, that I failed to find? > > aRFS+XPS ties app thread to a cpu, Well, not exactly. To pin the app thread to a CPU, one uses taskset/sched_setaffinity, while aRFS+XPS pick a queue that corresponds to that CPU. > whereas ADQ ties app thread to a napi > id which in turn ties to a queue(s) So, basically, both technologies result in making NAPI and the app run on the same CPU. The difference that I see is that ADQ forces NAPI processing (in busy polling) on the app's CPU, while aRFS steers the traffic to a queue, whose NAPI runs on the app's CPU. The effect is the same, but ADQ requires busy polling. Is my understanding correct? > ADQ also provides 2 levels of filtering compared to aRFS+XPS. The first > level of filtering selects a queue-set associated with the application > and the second level filter or RSS will select a queue within that queue > set associated with an app thread. This difference looks important. So, ADQ reserves a dedicated set of queues solely for the application use. > The current interface to configure ADQ limits us to support upto 16 > application specific queue sets(TC_MAX_QUEUE) From the commit message: https://patchwork.ozlabs.org/project/netdev/patch/20180214174539.11392-5-jeffrey.t.kirs...@intel.com/ I got that i40e supports up to 4 groups. Has this limitation been lifted, or are you saying that 16 is the limitation of mqprio, while the driver may support fewer? Or is it different for different Intel drivers? > > >> >> 3. At [1], you mention that ADQ can be used to create separate RSS >> sets. Could you elaborate about the API used? Does the tc mqprio >> configuration also affect RSS? Can it be turned on/off? > > Yes. tc mqprio allows to create queue-sets per application and the > driver configures RSS per queue-set. > >> >> 4. How is tc flower used in context of ADQ? Does the user need to >> reflect the configuration in both mqprio qdisc (for TX) and tc flower >> (for RX)? It looks like tc flower maps incoming traffic to TCs, but >> what is the mechanism of mapping TCs to RX queues? > > tc mqprio is used to map TCs to RX queues OK, I got how the configuration works now, thanks! Though I'm not sure mqprio is the best API to configure the RX side. I thought it's supposed to configure the TX queues. Looks more like a hack to me. Ethtool RSS context API (look for "context" in man ethtool) seems more appropriate for the RX side for this purpose. Thanks, Max > tc flower is used to configure the first level of filter to redirect > packets to a queue set associated with an application. > >> >> I really hope you will be able to shed more light on this feature to >> increase my awareness on how to use it and to compare it with aRFS. > > Hope this helps and we will go over in more detail in our netdev session. > >> >> Thanks, >> Max >> >> [1]: >> https://netdevconf.info/0x14/session.html?talk-ADQ-for-system-level-network-io-performance-improvements >> >>