> -----Original Message----- > From: Van Haaren, Harry > Sent: Tuesday, May 3, 2022 8:38 PM > To: Ilya Maximets <i.maxim...@ovn.org>; Richardson, Bruce > <bruce.richard...@intel.com> > Cc: Mcnamara, John <john.mcnam...@intel.com>; Hu, Jiayu <jiayu...@intel.com>; > Maxime Coquelin <maxime.coque...@redhat.com>; Morten Brørup > <m...@smartsharesystems.com>; Pai G, Sunil <sunil.pa...@intel.com>; Stokes, > Ian > <ian.sto...@intel.com>; Ferriter, Cian <cian.ferri...@intel.com>; ovs- > d...@openvswitch.org; dev@dpdk.org; O'Driscoll, Tim <tim.odrisc...@intel.com>; > Finn, Emma <emma.f...@intel.com> > Subject: RE: OVS DPDK DMA-Dev library/Design Discussion > > > -----Original Message----- > > From: Ilya Maximets <i.maxim...@ovn.org> > > Sent: Thursday, April 28, 2022 2:00 PM > > To: Richardson, Bruce <bruce.richard...@intel.com> > > Cc: i.maxim...@ovn.org; Mcnamara, John <john.mcnam...@intel.com>; Hu, > Jiayu > > <jiayu...@intel.com>; Maxime Coquelin <maxime.coque...@redhat.com>; Van > > Haaren, Harry <harry.van.haa...@intel.com>; Morten Brørup > > <m...@smartsharesystems.com>; Pai G, Sunil <sunil.pa...@intel.com>; Stokes, > > Ian > > <ian.sto...@intel.com>; Ferriter, Cian <cian.ferri...@intel.com>; ovs- > > d...@openvswitch.org; dev@dpdk.org; O'Driscoll, Tim > > <tim.odrisc...@intel.com>; > > Finn, Emma <emma.f...@intel.com> > > Subject: Re: OVS DPDK DMA-Dev library/Design Discussion > > > > On 4/27/22 22:34, Bruce Richardson wrote: > > > On Mon, Apr 25, 2022 at 11:46:01PM +0200, Ilya Maximets wrote: > > >> On 4/20/22 18:41, Mcnamara, John wrote: > > >>>> -----Original Message----- > > >>>> From: Ilya Maximets <i.maxim...@ovn.org> > > >>>> Sent: Friday, April 8, 2022 10:58 AM > > >>>> To: Hu, Jiayu <jiayu...@intel.com>; Maxime Coquelin > > >>>> <maxime.coque...@redhat.com>; Van Haaren, Harry > > >>>> <harry.van.haa...@intel.com>; Morten Brørup > > <m...@smartsharesystems.com>; > > >>>> Richardson, Bruce <bruce.richard...@intel.com> > > >>>> Cc: i.maxim...@ovn.org; Pai G, Sunil <sunil.pa...@intel.com>; Stokes, > > >>>> Ian > > >>>> <ian.sto...@intel.com>; Ferriter, Cian <cian.ferri...@intel.com>; ovs- > > >>>> d...@openvswitch.org; dev@dpdk.org; Mcnamara, John > > >>>> <john.mcnam...@intel.com>; O'Driscoll, Tim <tim.odrisc...@intel.com>; > > >>>> Finn, Emma <emma.f...@intel.com> > > >>>> Subject: Re: OVS DPDK DMA-Dev library/Design Discussion > > >>>> > > >>>> On 4/8/22 09:13, Hu, Jiayu wrote: > > >>>>> > > >>>>> > > >>>>>> -----Original Message----- > > >>>>>> From: Ilya Maximets <i.maxim...@ovn.org> > > >>>>>> Sent: Thursday, April 7, 2022 10:40 PM > > >>>>>> To: Maxime Coquelin <maxime.coque...@redhat.com>; Van Haaren, Harry > > >>>>>> <harry.van.haa...@intel.com>; Morten Brørup > > >>>>>> <m...@smartsharesystems.com>; Richardson, Bruce > > >>>>>> <bruce.richard...@intel.com> > > >>>>>> Cc: i.maxim...@ovn.org; Pai G, Sunil <sunil.pa...@intel.com>; Stokes, > > >>>>>> Ian <ian.sto...@intel.com>; Hu, Jiayu <jiayu...@intel.com>; Ferriter, > > >>>>>> Cian <cian.ferri...@intel.com>; ovs-...@openvswitch.org; > > >>>>>> dev@dpdk.org; Mcnamara, John <john.mcnam...@intel.com>; O'Driscoll, > > >>>>>> Tim <tim.odrisc...@intel.com>; Finn, Emma <emma.f...@intel.com> > > >>>>>> Subject: Re: OVS DPDK DMA-Dev library/Design Discussion > > >>>>>> > > >>>>>> On 4/7/22 16:25, Maxime Coquelin wrote: > > >>>>>>> Hi Harry, > > >>>>>>> > > >>>>>>> On 4/7/22 16:04, Van Haaren, Harry wrote: > > >>>>>>>> Hi OVS & DPDK, Maintainers & Community, > > >>>>>>>> > > >>>>>>>> Top posting overview of discussion as replies to thread become > > >>>> slower: > > >>>>>>>> perhaps it is a good time to review and plan for next steps? > > >>>>>>>> > > >>>>>>>> From my perspective, it those most vocal in the thread seem to be > > >>>>>>>> in favour of the clean rx/tx split ("defer work"), with the > > >>>>>>>> tradeoff that the application must be aware of handling the async > > >>>>>>>> DMA completions. If there are any concerns opposing upstreaming of > > >>>>>>>> this > > >>>>>> method, please indicate this promptly, and we can continue technical > > >>>>>> discussions here now. > > >>>>>>> > > >>>>>>> Wasn't there some discussions about handling the Virtio completions > > >>>>>>> with the DMA engine? With that, we wouldn't need the deferral of > > >>>>>>> work. > > >>>>>> > > >>>>>> +1 > > >>>>>> > > >>>>>> With the virtio completions handled by DMA itself, the vhost port > > >>>>>> turns almost into a real HW NIC. With that we will not need any > > >>>>>> extra manipulations from the OVS side, i.e. no need to defer any work > > >>>>>> while maintaining clear split between rx and tx operations. > > >>>>> > > >>>>> First, making DMA do 2B copy would sacrifice performance, and I think > > >>>>> we all agree on that. > > >>>> > > >>>> I do not agree with that. Yes, 2B copy by DMA will likely be slower > > >>>> than > > >>>> done by CPU, however CPU is going away for dozens or even hundreds of > > >>>> thousands of cycles to process a new packet batch or service other > > >>>> ports, > > >>>> hence DMA will likely complete the transmission faster than waiting for > > >>>> the CPU thread to come back to that task. In any case, this has to be > > >>>> tested. > > >>>> > > >>>>> Second, this method comes with an issue of ordering. > > >>>>> For example, PMD thread0 enqueue 10 packets to vring0 first, then PMD > > >>>>> thread1 enqueue 20 packets to vring0. If PMD thread0 and threa1 have > > >>>>> own dedicated DMA device dma0 and dma1, flag/index update for the > > >>>>> first 10 packets is done by dma0, and flag/index update for the left > > >>>>> 20 packets is done by dma1. But there is no ordering guarantee among > > >>>>> different DMA devices, so flag/index update may error. If PMD threads > > >>>>> don't have dedicated DMA devices, which means DMA devices are shared > > >>>>> among threads, we need lock and pay for lock contention in data-path. > > >>>>> Or we can allocate DMA devices for vring dynamically to avoid DMA > > >>>>> sharing among threads. But what's the overhead of allocation > > >>>>> mechanism? > > >>>> Who does it? Any thoughts? > > >>>> > > >>>> 1. DMA completion was discussed in context of per-queue allocation, so > > >>>> there > > >>>> is no re-ordering in this case. > > >>>> > > >>>> 2. Overhead can be minimal if allocated device can stick to the queue > > >>>> for > > >>>> a > > >>>> reasonable amount of time without re-allocation on every send. You > > >>>> may > > >>>> look at XPS implementation in lib/dpif-netdev.c in OVS for example > > >>>> of > > >>>> such mechanism. For sure it can not be the same, but ideas can be > > >>>> re- > > >>>> used. > > >>>> > > >>>> 3. Locking doesn't mean contention if resources are > > >>>> allocated/distributed > > >>>> thoughtfully. > > >>>> > > >>>> 4. Allocation can be done be either OVS or vhost library itself, I'd > > >>>> vote > > >>>> for doing that inside the vhost library, so any DPDK application and > > >>>> vhost ethdev can use it without re-inventing from scratch. It also > > >>>> should > > >>>> be simpler from the API point of view if allocation and usage are in > > >>>> the same place. But I don't have a strong opinion here as for now, > > >>>> since > > >>>> no real code examples exist, so it's hard to evaluate how they could > > >>>> look > > >>>> like. > > >>>> > > >>>> But I feel like we're starting to run in circles here as I did already > > >>>> say > > >>>> most of that before. > > >>> > > >>> > > >> > > >> Hi, John. > > >> > > >> Just reading this email as I was on PTO for a last 1.5 weeks > > >> and didn't get through all the emails yet. > > >> > > >>> This does seem to be going in circles, especially since there seemed to > > >>> be > > technical alignment on the last public call on March 29th. > > >> > > >> I guess, there is a typo in the date here. > > >> It seems to be 26th, not 29th. > > >> > > >>> It is not feasible to do a real world implementation/POC of every design > > proposal. > > >> > > >> FWIW, I think it makes sense to PoC and test options that are > > >> going to be simply unavailable going forward if not explored now. > > >> Especially because we don't have any good solutions anyway > > >> ("Deferral of Work" is architecturally wrong solution for OVS). > > >> > > > > > > Hi Ilya, > > > > > > for those of us who haven't spent a long time working on OVS, can you > > > perhaps explain a bit more as to why it is architecturally wrong? From my > > > experience with DPDK, use of any lookaside accelerator, not just DMA but > > > any crypto, compression or otherwise, requires asynchronous operation, and > > > therefore some form of setting work aside temporarily to do other tasks. > > > > OVS doesn't use any lookaside accelerators and doesn't have any > > infrastructure for them. > > > > > > Let me create a DPDK analogy of what is proposed for OVS. > > > > DPDK has an ethdev API that abstracts different device drivers for > > the application. This API has a rte_eth_tx_burst() function that > > is supposed to send packets through the particular network interface. > > > > Imagine now that there is a network card that is not capable of > > sending packets right away and requires the application to come > > back later to finish the operation. That is an obvious problem, > > because rte_eth_tx_burst() doesn't require any extra actions and > > doesn't take ownership of packets that wasn't consumed. > > > > The proposed solution for this problem is to change the ethdev API: > > > > 1. Allow rte_eth_tx_burst() to return -EINPROGRESS that effectively > > means that packets was acknowledged, but not actually sent yet. > > > > 2. Require the application to call the new rte_eth_process_async() > > function sometime later until it doesn't return -EINPROGRESS > > anymore, in case the original rte_eth_tx_burst() call returned > > -EINPROGRESS. > > > > The main reason why this proposal is questionable: > > > > It's only one specific device that requires this special handling, > > all other devices are capable of sending packets right away. > > However, every DPDK application now has to implement some kind > > of "Deferral of Work" mechanism in order to be compliant with > > the updated DPDK ethdev API. > > > > Will DPDK make this API change? > > I have no voice in DPDK API design decisions, but I'd argue against. > > > > Interestingly, that's not really an imaginary proposal. That is > > an exact change required for DPDK ethdev API in order to add > > vhost async support to the vhost ethdev driver. > > > > Going back to OVS: > > > > An oversimplified architecture of OVS has 3 layers (top to bottom): > > > > 1. OFproto - the layer that handles OpenFlow. > > 2. Datapath Interface - packet processing. > > 3. Netdev - abstraction on top of all the different port types. > > > > Each layer has it's own API that allows different implementations > > of the same layer to be used interchangeably without any modifications > > to higher layers. That's what APIs and encapsulation is for. > > > > So, Netdev layer has it's own API and this API is actually very > > similar to the DPDK's ethdev API. Simply because they are serving > > the same purpose - abstraction on top of different network interfaces. > > Beside different types of DPDK ports, there are also several types > > of native linux, bsd and windows ports, variety of different tunnel > > ports. > > > > Datapath interface layer is an "application" from the ethdev analogy > > above. > > > > What is proposed by "Deferral of Work" solution is to make pretty > > much the same API change that I described, but to netdev layer API > > inside the OVS, and introduce a fairly complex (and questionable, > > but I'm not going into that right now) machinery to handle that API > > change into the datapath interface layer. > > > > So, exactly the same problem is here: > > > > If the API change is needed only for a single port type in a very > > specific hardware environment, why we need to change the common > > API and rework a lot of the code in upper layers in order to accommodate > > that API change, while it makes no practical sense for any other > > port types or more generic hardware setups? > > And similar changes will have to be done in any other DPDK application > > that is not bound to a specific hardware, but wants to support vhost > > async. > > > > The right solution, IMO, is to make vhost async behave as any other > > physical NIC, since it is essentially a physical NIC now (we're not > > using DMA directly, it's a combined vhost+DMA solution), instead of > > propagating quirks of the single device to a common API. > > > > And going back to DPDK, this implementation doesn't allow use of > > vhost async in the DPDK's own vhost ethdev driver. > > > > My initial reply to the "Deferral of Work" RFC with pretty much > > the same concerns: > > > https://patchwork.ozlabs.org/project/openvswitch/patch/20210907111725.43672- > > 2-cian.ferri...@intel.com/#2751799 > > > > Best regards, Ilya Maximets. > > > Hi Ilya, > > Thanks for replying in more detail, understanding your perspective here helps > to > communicate the various solutions benefits and drawbacks. Agreed the > OfProto/Dpif/Netdev > abstraction layers are strong abstractions in OVS, and in general they serve > their > purpose. > > A key difference between OVS's usage of DPDK Ethdev TX and VHost TX is that > the > performance > of each is very different: as you know, sending a 1500 byte packet over a > physical > NIC, or via > VHost into a guest has a very different CPU cycle cost. Typically DPDK Tx > takes ~5% > CPU cycles > while vhost copies are often ~30%, but can be > 50% in certain packet- > sizes/configurations. > > Let's view the performance of the above example from the perspective of an > actual > deployment: OVS is > very often deployed to provide an accelerated packet interface to a guest/VM > via > Vhost/Virtio. > Surely improving performance of this primary use-case is a valid reason to > consider > changes and > improvements to an internal abstraction layer in OVS? > > Today DPDK tx and vhost tx are called via the same netdev abstraction, but we > must > ask the questions: > - Is the netdev abstraction really the best it can be? > - Does adding an optional "async" feature to the abstraction improve > performance significantly? (positive from including?) > - Does adding the optional async feature cause actual degradation in > DPIF > implementations that don't support/use it? (negative due to including?) > > Of course strong abstractions are valuable, and of course changing them > requires > careful thought. > But let's be clear - It is probably fair to say that OVS is not deployed > because it has > good abstractions internally. > It is deployed because it is useful, and serves the need of an end-user. And > part of > the end-user needs is performance. > > The suggestion of integrating "Defer Work" method of exposing async in the OVS > Datapath is well thought out, > and a clean way of handling async work in a per-thread manner at the > application > layer. It is the most common way of integrating > lookaside acceleration in software pipelines, and handling the async work at > application thread level is the only logical place where > the programmer can reason about tradeoffs for a specific use-case. Adding "dma > acceleration to Vhost" will inevitably lead to > compromises in the DPDK implementation, and ones that might (or might not) > work > for OVS and other apps. > > As you know, there have been OVS Conference presentations[1][2], RFCs and > POCs[3][4][5][6], and community calls[7][8][9] on the topic. > In the various presentations, the benefits of using application-level > deferral of work > are highlighted, and compared to other implementations > which have non-desirable side-effects. We haven't heard any objections that > people > won't use OVS if the netdev abstraction is changed. > > It seems there is a trade-off decision to be made; > A) Change/improve the netdev abstraction to allow for async > accelerations, > and pay the cost of added app layer complexity > B) Demand dma-acceleration is pushed down into vhost & below (as netdev > abstraction is not going to be changed), > resulting in sub-par and potentially unusable code for any given > app, as > lower-layers cannot reason about app level specifics > > How can we the OVS/DPDK developers and users make a decision here?
Ping on this topic - there's an ask here to find how to best move forward, so welcoming input from everyone, and specifically OVS maintainers & tech leaders. > Regards, -Harry > > [1] https://www.openvswitch.org/support/ovscon2020/#C3 > [2] https://www.openvswitch.org/support/ovscon2021/#T12 > [3] rawdev; > https://patchwork.ozlabs.org/project/openvswitch/patch/20201023094845.35652- > 2-sunil.pa...@intel.com/ > [4] defer work; > http://patchwork.ozlabs.org/project/openvswitch/list/?series=261267&state=* > [5] v3; > http://patchwork.ozlabs.org/project/openvswitch/patch/20220104125242.1064162 > -2-sunil.pa...@intel.com/ > [6] v4; > http://patchwork.ozlabs.org/project/openvswitch/patch/20220321173640.326795- > 2-sunil.pa...@intel.com/ > [7] Slides session 1; https://github.com/Sunil-Pai-G/OVS-DPDK-presentation- > share/raw/main/OVS%20vhost%20async%20datapath%20design%202022.pdf > [8] Slides session 2; https://github.com/Sunil-Pai-G/OVS-DPDK-presentation- > share/raw/main/OVS%20vhost%20async%20datapath%20design%202022%20sessio > n%202.pdf > [9] Slides session 3; https://github.com/Sunil-Pai-G/OVS-DPDK-presentation- > share/raw/main/ovs_datapath_design_2022%20session%203.pdf