On 3/23/19 9:55 PM, Alexei Starovoitov wrote: > On Sat, Mar 23, 2019 at 09:40:23PM -0400, David Miller wrote: >> From: David Ahern <dsah...@kernel.org> >> Date: Fri, 22 Mar 2019 06:06:09 -0700 >> >>> From: David Ahern <dsah...@gmail.com> >>> >>> The number of stubs is growing and has nothing to do with addrconf. >>> Move the definition of the stubs to a separate header file and update >>> users. In the move, drop the vxlan specific comment before ipv6_stub. >>> >>> Code move only; no functional change intended. >>> >>> Signed-off-by: David Ahern <dsah...@gmail.com> >> >> Eric, I fully support David's overall plan to make separate nexthop >> objects as it will significantly empower the stack to do more sensible >> things when links flap etc. > > let's agree to disagree. > 'link flaps' were not mentioned in the cover letter for: > "net: Improve route scalability via support for nexthop objects" > > The _only_ value of 86 patches is to align linux kernel routing > with switch ASICs, because cumulus is trying to reuse iproute2 > to manage them. > It was broken model to begin with and it keeps complicating routing > when linux is used as a host while not achieving the goal of iproute2 > for switches. > Can anyone use off the shelf linux to manage trident/tomahawk switches? Nope. > brcm sdk is still necessary. > nexthop objects are essential to configure enterprise switches. > Clearly cumulus customers don't like iproute2 style because it's missing > this feature, so David's proposal is to add that to the kernel. > Even after kernel and iproute2 understand nexthop id the kernel is still > not going to be competitive with switching os. The linux kernel is an OS > to run on the host cpu and to run on a control plane cpu of a switch. > That is all great, but the reasons to push routing into the kernel > of control plane cpu were weak. It's not using these routes. > Such architecture allowed temporary reuse of bgp daemons, but it fails to > scale. > No need to push route to the kernel when kernel won't use them. > Hence an alternative proposal: > - introduce hooks at netlink layer and steal back and forth messages > from your favorite daemon without populating the kernel > - same for iproute2 netlink interaction >
The use case here is not just Cumulus or switchdev, but ANY OS using the Linux API and the kernel to configure and manage its networking state [1] and that includes XDP based use cases [2] and routing on the host.[3] This is not about iproute2 driving networking deployments. This is about continuing to remove the 'fails to scale' notion which *forces* a NOS architecture away from the kernel databases as the single source of truth and the kernel's IPC/notification mechanisms, and the subsequent impacts of that choice which negates the Linux ecosystem forcing a customization of all of the software running in the control plane to work in some vendor's custom environment. You should read the paper I wrote last summer [1]. This current patch set is not just about link flaps, but improving the overall scaling properties of managing the FIB. This is about leveraging existing ideas about network models and their scalability properties and bringing that efficiency to Linux. With nexthops, the time to insert routes is near constant regardless of the number of nexthops in the route. So the time to insert a single path route and the time to insert a route with 2, 4, 8, 16, 32, … nexthops is the same. That is a HUGE scalability improvement from a simple idea. The “near” constant is because of the need to expand nexthop definitions in the route notifications to userspace to enable legacy applications to work with this new API. In time, a lever can be added to not expand the definitions and let the RTA_NHID alone point to it, allowing companies who know that there are no legacy apps that need the nexthop expansion to gain the full scaling improvements. This change also enables many other key features: 1. IPv4 multipath routes are not evicted just because 1 hop goes down. 2. IPv6 multipath routes with device only nexthops (e.g., tunnels). 3. IPv6 nexthop with IPv4 route (aka, RFC 5549) which enables a more natural BGP unnumbered. 4. Lower memory consumption for IPv6 FIB entries which has no sharing at all like IPv4 does. 5. Allows atomic update of nexthop definitions with a single replace command as opposed to replacing the N-routes using it. The list goes on, but 2-5 of the above were in the cover letter I sent on March 14. I have spent a lot of time over the last few years not just working on features like VRF and MPLS, but improving the scaling properties of Linux and removing this 'fails to scale' notion you and others hold. This current patch set is just another step in that path. [1] https://www.files.netdevconf.org/d/f982086fdd6946d9b596/ [2] http://vger.kernel.org/lpc_net2018_talks/dsa-xdp-kernel-tables-paper.pdf [3] https://netdevconf.org/1.2/slides/oct7/01_ahern_microservice_net_vrf_on_host.pdf