RE: OVS DPDK DMA-Dev library/Design Discussion

Van Haaren, Harry Tue, 10 May 2022 07:39:38 -0700

> -----Original Message-----
> From: Van Haaren, Harry
> Sent: Tuesday, May 3, 2022 8:38 PM
> To: Ilya Maximets <i.maxim...@ovn.org>; Richardson, Bruce
> <bruce.richard...@intel.com>
> Cc: Mcnamara, John <john.mcnam...@intel.com>; Hu, Jiayu <jiayu...@intel.com>;
> Maxime Coquelin <maxime.coque...@redhat.com>; Morten Brørup
> <m...@smartsharesystems.com>; Pai G, Sunil <sunil.pa...@intel.com>; Stokes, 
> Ian
> <ian.sto...@intel.com>; Ferriter, Cian <cian.ferri...@intel.com>; ovs-
> d...@openvswitch.org; dev@dpdk.org; O'Driscoll, Tim <tim.odrisc...@intel.com>;
> Finn, Emma <emma.f...@intel.com>
> Subject: RE: OVS DPDK DMA-Dev library/Design Discussion
> 
> > -----Original Message-----
> > From: Ilya Maximets <i.maxim...@ovn.org>
> > Sent: Thursday, April 28, 2022 2:00 PM
> > To: Richardson, Bruce <bruce.richard...@intel.com>
> > Cc: i.maxim...@ovn.org; Mcnamara, John <john.mcnam...@intel.com>; Hu,
> Jiayu
> > <jiayu...@intel.com>; Maxime Coquelin <maxime.coque...@redhat.com>; Van
> > Haaren, Harry <harry.van.haa...@intel.com>; Morten Brørup
> > <m...@smartsharesystems.com>; Pai G, Sunil <sunil.pa...@intel.com>; Stokes, 
> > Ian
> > <ian.sto...@intel.com>; Ferriter, Cian <cian.ferri...@intel.com>; ovs-
> > d...@openvswitch.org; dev@dpdk.org; O'Driscoll, Tim 
> > <tim.odrisc...@intel.com>;
> > Finn, Emma <emma.f...@intel.com>
> > Subject: Re: OVS DPDK DMA-Dev library/Design Discussion
> >
> > On 4/27/22 22:34, Bruce Richardson wrote:
> > > On Mon, Apr 25, 2022 at 11:46:01PM +0200, Ilya Maximets wrote:
> > >> On 4/20/22 18:41, Mcnamara, John wrote:
> > >>>> -----Original Message-----
> > >>>> From: Ilya Maximets <i.maxim...@ovn.org>
> > >>>> Sent: Friday, April 8, 2022 10:58 AM
> > >>>> To: Hu, Jiayu <jiayu...@intel.com>; Maxime Coquelin
> > >>>> <maxime.coque...@redhat.com>; Van Haaren, Harry
> > >>>> <harry.van.haa...@intel.com>; Morten Brørup
> > <m...@smartsharesystems.com>;
> > >>>> Richardson, Bruce <bruce.richard...@intel.com>
> > >>>> Cc: i.maxim...@ovn.org; Pai G, Sunil <sunil.pa...@intel.com>; Stokes, 
> > >>>> Ian
> > >>>> <ian.sto...@intel.com>; Ferriter, Cian <cian.ferri...@intel.com>; ovs-
> > >>>> d...@openvswitch.org; dev@dpdk.org; Mcnamara, John
> > >>>> <john.mcnam...@intel.com>; O'Driscoll, Tim <tim.odrisc...@intel.com>;
> > >>>> Finn, Emma <emma.f...@intel.com>
> > >>>> Subject: Re: OVS DPDK DMA-Dev library/Design Discussion
> > >>>>
> > >>>> On 4/8/22 09:13, Hu, Jiayu wrote:
> > >>>>>
> > >>>>>
> > >>>>>> -----Original Message-----
> > >>>>>> From: Ilya Maximets <i.maxim...@ovn.org>
> > >>>>>> Sent: Thursday, April 7, 2022 10:40 PM
> > >>>>>> To: Maxime Coquelin <maxime.coque...@redhat.com>; Van Haaren, Harry
> > >>>>>> <harry.van.haa...@intel.com>; Morten Brørup
> > >>>>>> <m...@smartsharesystems.com>; Richardson, Bruce
> > >>>>>> <bruce.richard...@intel.com>
> > >>>>>> Cc: i.maxim...@ovn.org; Pai G, Sunil <sunil.pa...@intel.com>; Stokes,
> > >>>>>> Ian <ian.sto...@intel.com>; Hu, Jiayu <jiayu...@intel.com>; Ferriter,
> > >>>>>> Cian <cian.ferri...@intel.com>; ovs-...@openvswitch.org;
> > >>>>>> dev@dpdk.org; Mcnamara, John <john.mcnam...@intel.com>; O'Driscoll,
> > >>>>>> Tim <tim.odrisc...@intel.com>; Finn, Emma <emma.f...@intel.com>
> > >>>>>> Subject: Re: OVS DPDK DMA-Dev library/Design Discussion
> > >>>>>>
> > >>>>>> On 4/7/22 16:25, Maxime Coquelin wrote:
> > >>>>>>> Hi Harry,
> > >>>>>>>
> > >>>>>>> On 4/7/22 16:04, Van Haaren, Harry wrote:
> > >>>>>>>> Hi OVS & DPDK, Maintainers & Community,
> > >>>>>>>>
> > >>>>>>>> Top posting overview of discussion as replies to thread become
> > >>>> slower:
> > >>>>>>>> perhaps it is a good time to review and plan for next steps?
> > >>>>>>>>
> > >>>>>>>>  From my perspective, it those most vocal in the thread seem to be
> > >>>>>>>> in favour of the clean rx/tx split ("defer work"), with the
> > >>>>>>>> tradeoff that the application must be aware of handling the async
> > >>>>>>>> DMA completions. If there are any concerns opposing upstreaming of
> > >>>>>>>> this
> > >>>>>> method, please indicate this promptly, and we can continue technical
> > >>>>>> discussions here now.
> > >>>>>>>
> > >>>>>>> Wasn't there some discussions about handling the Virtio completions
> > >>>>>>> with the DMA engine? With that, we wouldn't need the deferral of 
> > >>>>>>> work.
> > >>>>>>
> > >>>>>> +1
> > >>>>>>
> > >>>>>> With the virtio completions handled by DMA itself, the vhost port
> > >>>>>> turns almost into a real HW NIC.  With that we will not need any
> > >>>>>> extra manipulations from the OVS side, i.e. no need to defer any work
> > >>>>>> while maintaining clear split between rx and tx operations.
> > >>>>>
> > >>>>> First, making DMA do 2B copy would sacrifice performance, and I think
> > >>>>> we all agree on that.
> > >>>>
> > >>>> I do not agree with that.  Yes, 2B copy by DMA will likely be slower 
> > >>>> than
> > >>>> done by CPU, however CPU is going away for dozens or even hundreds of
> > >>>> thousands of cycles to process a new packet batch or service other 
> > >>>> ports,
> > >>>> hence DMA will likely complete the transmission faster than waiting for
> > >>>> the CPU thread to come back to that task.  In any case, this has to be
> > >>>> tested.
> > >>>>
> > >>>>> Second, this method comes with an issue of ordering.
> > >>>>> For example, PMD thread0 enqueue 10 packets to vring0 first, then PMD
> > >>>>> thread1 enqueue 20 packets to vring0. If PMD thread0 and threa1 have
> > >>>>> own dedicated DMA device dma0 and dma1, flag/index update for the
> > >>>>> first 10 packets is done by dma0, and flag/index update for the left
> > >>>>> 20 packets is done by dma1. But there is no ordering guarantee among
> > >>>>> different DMA devices, so flag/index update may error. If PMD threads
> > >>>>> don't have dedicated DMA devices, which means DMA devices are shared
> > >>>>> among threads, we need lock and pay for lock contention in data-path.
> > >>>>> Or we can allocate DMA devices for vring dynamically to avoid DMA
> > >>>>> sharing among threads. But what's the overhead of allocation 
> > >>>>> mechanism?
> > >>>> Who does it? Any thoughts?
> > >>>>
> > >>>> 1. DMA completion was discussed in context of per-queue allocation, so
> > >>>> there
> > >>>>    is no re-ordering in this case.
> > >>>>
> > >>>> 2. Overhead can be minimal if allocated device can stick to the queue 
> > >>>> for
> > >>>> a
> > >>>>    reasonable amount of time without re-allocation on every send.  You 
> > >>>> may
> > >>>>    look at XPS implementation in lib/dpif-netdev.c in OVS for example 
> > >>>> of
> > >>>>    such mechanism.  For sure it can not be the same, but ideas can be 
> > >>>> re-
> > >>>> used.
> > >>>>
> > >>>> 3. Locking doesn't mean contention if resources are 
> > >>>> allocated/distributed
> > >>>>    thoughtfully.
> > >>>>
> > >>>> 4. Allocation can be done be either OVS or vhost library itself, I'd 
> > >>>> vote
> > >>>>    for doing that inside the vhost library, so any DPDK application and
> > >>>>    vhost ethdev can use it without re-inventing from scratch.  It also
> > >>>> should
> > >>>>    be simpler from the API point of view if allocation and usage are in
> > >>>>    the same place.  But I don't have a strong opinion here as for now,
> > >>>> since
> > >>>>    no real code examples exist, so it's hard to evaluate how they could
> > >>>> look
> > >>>>    like.
> > >>>>
> > >>>> But I feel like we're starting to run in circles here as I did already 
> > >>>> say
> > >>>> most of that before.
> > >>>
> > >>>
> > >>
> > >> Hi, John.
> > >>
> > >> Just reading this email as I was on PTO for a last 1.5 weeks
> > >> and didn't get through all the emails yet.
> > >>
> > >>> This does seem to be going in circles, especially since there seemed to 
> > >>> be
> > technical alignment on the last public call on March 29th.
> > >>
> > >> I guess, there is a typo in the date here.
> > >> It seems to be 26th, not 29th.
> > >>
> > >>> It is not feasible to do a real world implementation/POC of every design
> > proposal.
> > >>
> > >> FWIW, I think it makes sense to PoC and test options that are
> > >> going to be simply unavailable going forward if not explored now.
> > >> Especially because we don't have any good solutions anyway
> > >> ("Deferral of Work" is architecturally wrong solution for OVS).
> > >>
> > >
> > > Hi Ilya,
> > >
> > > for those of us who haven't spent a long time working on OVS, can you
> > > perhaps explain a bit more as to why it is architecturally wrong? From my
> > > experience with DPDK, use of any lookaside accelerator, not just DMA but
> > > any crypto, compression or otherwise, requires asynchronous operation, and
> > > therefore some form of setting work aside temporarily to do other tasks.
> >
> > OVS doesn't use any lookaside accelerators and doesn't have any
> > infrastructure for them.
> >
> >
> > Let me create a DPDK analogy of what is proposed for OVS.
> >
> > DPDK has an ethdev API that abstracts different device drivers for
> > the application.  This API has a rte_eth_tx_burst() function that
> > is supposed to send packets through the particular network interface.
> >
> > Imagine now that there is a network card that is not capable of
> > sending packets right away and requires the application to come
> > back later to finish the operation.  That is an obvious problem,
> > because rte_eth_tx_burst() doesn't require any extra actions and
> > doesn't take ownership of packets that wasn't consumed.
> >
> > The proposed solution for this problem is to change the ethdev API:
> >
> > 1. Allow rte_eth_tx_burst() to return -EINPROGRESS that effectively
> >    means that packets was acknowledged, but not actually sent yet.
> >
> > 2. Require the application to call the new rte_eth_process_async()
> >    function sometime later until it doesn't return -EINPROGRESS
> >    anymore, in case the original rte_eth_tx_burst() call returned
> >    -EINPROGRESS.
> >
> > The main reason why this proposal is questionable:
> >
> > It's only one specific device that requires this special handling,
> > all other devices are capable of sending packets right away.
> > However, every DPDK application now has to implement some kind
> > of "Deferral of Work" mechanism in order to be compliant with
> > the updated DPDK ethdev API.
> >
> > Will DPDK make this API change?
> > I have no voice in DPDK API design decisions, but I'd argue against.
> >
> > Interestingly, that's not really an imaginary proposal.  That is
> > an exact change required for DPDK ethdev API in order to add
> > vhost async support to the vhost ethdev driver.
> >
> > Going back to OVS:
> >
> > An oversimplified architecture of OVS has 3 layers (top to bottom):
> >
> > 1. OFproto - the layer that handles OpenFlow.
> > 2. Datapath Interface - packet processing.
> > 3. Netdev - abstraction on top of all the different port types.
> >
> > Each layer has it's own API that allows different implementations
> > of the same layer to be used interchangeably without any modifications
> > to higher layers.  That's what APIs and encapsulation is for.
> >
> > So, Netdev layer has it's own API and this API is actually very
> > similar to the DPDK's ethdev API.  Simply because they are serving
> > the same purpose - abstraction on top of different network interfaces.
> > Beside different types of DPDK ports, there are also several types
> > of native linux, bsd and windows ports, variety of different tunnel
> > ports.
> >
> > Datapath interface layer is an "application" from the ethdev analogy
> > above.
> >
> > What is proposed by "Deferral of Work" solution is to make pretty
> > much the same API change that I described, but to netdev layer API
> > inside the OVS, and introduce a fairly complex (and questionable,
> > but I'm not going into that right now) machinery to handle that API
> > change into the datapath interface layer.
> >
> > So, exactly the same problem is here:
> >
> > If the API change is needed only for a single port type in a very
> > specific hardware environment, why we need to change the common
> > API and rework a lot of the code in upper layers in order to accommodate
> > that API change, while it makes no practical sense for any other
> > port types or more generic hardware setups?
> > And similar changes will have to be done in any other DPDK application
> > that is not bound to a specific hardware, but wants to support vhost
> > async.
> >
> > The right solution, IMO, is to make vhost async behave as any other
> > physical NIC, since it is essentially a physical NIC now (we're not
> > using DMA directly, it's a combined vhost+DMA solution), instead of
> > propagating quirks of the single device to a common API.
> >
> > And going back to DPDK, this implementation doesn't allow use of
> > vhost async in the DPDK's own vhost ethdev driver.
> >
> > My initial reply to the "Deferral of Work" RFC with pretty much
> > the same concerns:
> >
> https://patchwork.ozlabs.org/project/openvswitch/patch/20210907111725.43672-
> > 2-cian.ferri...@intel.com/#2751799
> >
> > Best regards, Ilya Maximets.
> 
> 
> Hi Ilya,
> 
> Thanks for replying in more detail, understanding your perspective here helps 
> to
> communicate the various solutions benefits and drawbacks. Agreed the
> OfProto/Dpif/Netdev
> abstraction layers are strong abstractions in OVS, and in general they serve 
> their
> purpose.
> 
> A key difference between OVS's usage of DPDK Ethdev TX and VHost TX is that 
> the
> performance
> of each is very different: as you know, sending a 1500 byte packet over a 
> physical
> NIC, or via
> VHost into a guest has a very different CPU cycle cost. Typically DPDK Tx 
> takes ~5%
> CPU cycles
> while vhost copies are often ~30%, but can be > 50% in certain packet-
> sizes/configurations.
> 
> Let's view the performance of the above example from the perspective of an 
> actual
> deployment: OVS is
> very often deployed to provide an accelerated packet interface to a guest/VM 
> via
> Vhost/Virtio.
> Surely improving performance of this primary use-case is a valid reason to 
> consider
> changes and
> improvements to an internal abstraction layer in OVS?
> 
> Today DPDK tx and vhost tx are called via the same netdev abstraction, but we 
> must
> ask the questions:
>       - Is the netdev abstraction really the best it can be?
>       - Does adding an optional "async" feature to the abstraction improve
> performance significantly? (positive from including?)
>       - Does adding the optional async feature cause actual degradation in 
> DPIF
> implementations that don't support/use it? (negative due to including?)
> 
> Of course strong abstractions are valuable, and of course changing them 
> requires
> careful thought.
> But let's be clear - It is probably fair to say that OVS is not deployed 
> because it has
> good abstractions internally.
> It is deployed because it is useful, and serves the need of an end-user. And 
> part of
> the end-user needs is performance.
> 
> The suggestion of integrating "Defer Work" method of exposing async in the OVS
> Datapath is well thought out,
> and a clean way of handling async work in a per-thread manner at the 
> application
> layer. It is the most common way of integrating
> lookaside acceleration in software pipelines, and handling the async work at
> application thread level is the only logical place where
> the programmer can reason about tradeoffs for a specific use-case. Adding "dma
> acceleration to Vhost" will inevitably lead to
> compromises in the DPDK implementation, and ones that might (or might not) 
> work
> for OVS and other apps.
> 
> As you know, there have been OVS Conference presentations[1][2], RFCs and
> POCs[3][4][5][6], and community calls[7][8][9] on the topic.
> In the various presentations, the benefits of using application-level 
> deferral of work
> are highlighted, and compared to other implementations
> which have non-desirable side-effects. We haven't heard any objections that 
> people
> won't use OVS if the netdev abstraction is changed.
> 
> It seems there is a trade-off decision to be made;
>       A) Change/improve the netdev abstraction to allow for async 
> accelerations,
> and pay the cost of added app layer complexity
>       B) Demand dma-acceleration is pushed down into vhost & below (as netdev
> abstraction is not going to be changed),
>           resulting in sub-par and potentially unusable code for any given 
> app, as
> lower-layers cannot reason about app level specifics
> 
> How can we the OVS/DPDK developers and users make a decision here?


Ping on this topic - there's an ask here to find how to best move forward, so
welcoming input from everyone, and specifically OVS maintainers & tech leaders.


> Regards, -Harry
> 
> [1] https://www.openvswitch.org/support/ovscon2020/#C3
> [2] https://www.openvswitch.org/support/ovscon2021/#T12
> [3] rawdev;
> https://patchwork.ozlabs.org/project/openvswitch/patch/20201023094845.35652-
> 2-sunil.pa...@intel.com/
> [4] defer work;
> http://patchwork.ozlabs.org/project/openvswitch/list/?series=261267&state=*
> [5] v3;
> http://patchwork.ozlabs.org/project/openvswitch/patch/20220104125242.1064162
> -2-sunil.pa...@intel.com/
> [6] v4;
> http://patchwork.ozlabs.org/project/openvswitch/patch/20220321173640.326795-
> 2-sunil.pa...@intel.com/
> [7] Slides session 1; https://github.com/Sunil-Pai-G/OVS-DPDK-presentation-
> share/raw/main/OVS%20vhost%20async%20datapath%20design%202022.pdf
> [8] Slides session 2; https://github.com/Sunil-Pai-G/OVS-DPDK-presentation-
> share/raw/main/OVS%20vhost%20async%20datapath%20design%202022%20sessio
> n%202.pdf
> [9] Slides session 3; https://github.com/Sunil-Pai-G/OVS-DPDK-presentation-
> share/raw/main/ovs_datapath_design_2022%20session%203.pdf

RE: OVS DPDK DMA-Dev library/Design Discussion

Reply via email to