On Tue, Jun 26, 2018 at 9:16 PM, John Hurley <john.hur...@netronome.com> wrote: > On Tue, Jun 26, 2018 at 3:57 PM, Or Gerlitz <ogerl...@mellanox.com> wrote: >>> -------- Forwarded Message -------- >>> Subject: [PATCH 0/6] offload Linux LAG devices to the TC datapath >>> Date: Thu, 21 Jun 2018 14:35:55 +0100 >>> From: John Hurley <john.hur...@netronome.com> >>> To: d...@openvswitch.org, r...@mellanox.com, g...@mellanox.com, >>> pa...@mellanox.com, f...@sysclose.org, simon.hor...@netronome.com >>> CC: John Hurley <john.hur...@netronome.com> >>> >>> This patchset extends OvS TC and the linux-netdev implementation to >>> support the offloading of Linux Link Aggregation devices (LAG) and their >>> slaves. TC blocks are used to provide this offload. Blocks, in TC, group >>> together a series of qdiscs. If a filter is added to one of these qdiscs >>> then it applied to all. Similarly, if a packet is matched on one of the >>> grouped qdiscs then the stats for the entire block are increased. The >>> basis of the LAG offload is that the LAG master (attached to the OvS >>> bridge) and slaves that may exist outside of OvS are all added to the same >>> TC block. OvS can then control the filters and collect the stats on the >>> slaves via its interaction with the LAG master. >>> >>> The TC API is extended within OvS to allow the addition of a block id to >>> ingress qdisc adds. Block ids are then assigned to each LAG master that is >>> attached to the OvS bridge. The linux netdev netlink socket is used to >>> monitor slave devices. If a LAG slave is found whose master is on the bridge >>> then it is added to the same block as its master. If the underlying slaves >>> belong to an offloadable device then the Linux LAG device can be offloaded >>> to hardware. >> >> Guys (J/J/J), >> >> Doing this here b/c >> >> a. this has impact on the kernel side of things >> >> b. I am more of a netdev and not openvswitch citizen.. >> >> some comments, >> >> 1. this + Jakub's patch for the reply are really a great design >> >> 2. re the egress side of things. Some NIC HWs can't just use LAG >> as the egress port destination of an ACL (tc rule) and the HW rule >> needs to be duplicated to both HW ports. So... in that case, you >> see the HW driver doing the duplication (:() or we can somehow >> make it happen from user-space? >> > > Hi Or, > I'm not sure how rule duplication would work for rules that egress to > a LAG device. > Perhaps this could be done for an active/backup mode where user-space > adds a rule to 1 port and deletes from another as appropriate. > For load balancing modes where the egress port is selected based on a > hash of packet fields, it would be a lot more complicated. > OvS can do this with its own bonds as far as I'm aware but (if > recirculation is turned off) it basically creates exact match datapath > entries for each packet flow. > Perhaps I do not fully understand your question?
Hi John, Some NICs don't support egress lag hashing, still they can provide HW high-availability(HA) and load-balancing (LB)-- specifically here we are referring to get a VF netdev HA and LB without any action on their side, once we bond the uplink reps, apply your patch on ovs and some more.. So the use-case I am targeting (1) does it with kernel bond/team (2) uses the LAG/802.3ad mode of bonding/teaming and needs to duplicate rules where the egress is the bond. > >> 3. for the case of overlay networks, e.g OVS based vxlan tunnel, the >> ingress (decap) rule is set on the vxlan device. Jakub, you mentioned >> a possible kernel patch to the HW (nfp, mlx5) drivers to have them bind >> to the tunnel device for ingress rules. If we have agreed way to identify >> uplink representors, can we do that from ovs too? does it matter if we are >> bonding + encapsulating or just encapsulating? note that under encap scheme >> the bond is typically not part of the OVS bridge. >> > > If we have a way to bind the HW drivers to tunnel devs for ingress > rules then this should work fine with OvS (possibly requiring a small > patch - Id need to check). > > In terms of bonding + encap this probably needs to be handled in the > hw itself for the same reason I mentioned in point 2. so we have two cases where the stack/ovs can't do and the hw driver needs to act, lets try to improve and reduce that..