[Kernel-packages] [Bug 1737428] Re: VRF support to solve routing problems associated with multi-homing

Dmitrii Shcherbakov Thu, 21 Dec 2017 07:31:07 -0800

Andres,

I'm not going to be at the sprint but the problems described need a
proper solution in MAAS and Juju at least from the end host perspective.
Similar to how VLANs are supported natively in MAAS & Juju, L3
virtualization technologies like VRF should be as well. I hope the
information I will give here will be enough to understand the use-cases
and past experience in this field.


The concept is very similar to VLANs but for L3 which is probably less
familiar and spans many hosts and routers/L3 switches within a single
organization instead of being tied to a given switch fabric and either
the same process or a group of processes on a host need to (1) receive &
respond and (2) send data using different L3 topologies. Instead of
virtual broadcast domains you get virtual paths because of per-
virtual-L3 routing topologies. Good L2 analogies are Multiple Spanning
Tree Protocol (MSTP) or PVST+ that were created to avoid blocking of
switchports depending on logical L2 topologies related to a VLAN or
group of VLANs (this is hidden on L2 though - no end host modifications
required).

The use-cases I am talking about are not new - they were not used as
much in data center networks until a certain point. They were used in
service provider networks for multi-site L3 VPN for many years
(https://tools.ietf.org/html/rfc4364). There are still many deployments
which rely on large L2 domains where those problems do not occur as much
because routing is done trivially via using directly connected routes
and ARP broadcasts (there is never a hop between a source and
destination host in most cases).

I may be wrong but it seems to me that Network Spaces were originally
designed with multi-homing in mind but with limited support for multi-L2
and routing in mind (I don't judge, VRFs are fairly new to the Linux
kernel). They are not that far from supporting that though because of
the recent upstream kernel work.

With leaf-spine you are building a complex L3 network with different
virtual topologies for different purposes and different SLAs for various
kinds of traffic (IOW, a multi-tenant network). This is a typical
service provider scenario with different customers on a shared
infrastructure. You need to build many parallel dedicated communication
lines but since infrastructure is shared it is not possible physically,
however, you still need to do load-sharing across links, use distinct
paths for different kinds of traffic and other optimizations to make
sure your physical links are utilized and clients get certain quality of
service and are separated from each other. In this case L3 VPNs are
built not for clients (companies "x" and "y") but for different
purposes: general purpose data, storage access or replication,
management, public API traffic (originally, this was done for
voice/video/data, see the first two paragraphs in the "background"
section https://www.google.ch/patents/US8457117).

I can describe this in many ways, i.e. we need:

* multi-point L3VPN between racks to simulate L3 virtual circuits/pseudowires 
for different types of traffic;
* virtual routing domains (VRFs);
* traffic and routing separation for multi-L2 segment networks;
* L3 network multi-tenancy.

This is definitely not new, the service provider concepts may be less
familiar though:

1) Static routes + VLSM - DIY routing - doesn't scale and difficult to manage 
when a deployment grows beyond the original VLSM design;
2) VRF-lite (VRF without MPLS) - separate address spaces and routing tables for 
different traffic on routers and, potentially, hosts, interface-based selection 
of a VRF on a given network device;
3) MPLS - this is like VXLAN for virtual L3 networks. In a service provider 
network two MPLS labels are used: one for VRF identification and another one 
for next-hop router identification (in a data center network think of an 
internal or public API label, storage access label, storage replication label 
etc.).

This has been used for years to separate out traffic of different
customers or, for example, general purpose data, voice and video for a
single customer. Containers do not solve this problem with a separate
network namespace because the same process or a group of processes need
to use a different routing table "per-purpose".

What I am asking for is not that difficult because we are only concerned
with end hosts (unless MAAS resides on a ToR or a leaf and we control
the switch OS). I need building blocks to use either VRF-lite or full
VRFs with MPLS in a sane way while keeping routing complexity (BGP, MPLS
etc.) in a data center provider network managed by other people.

Terminology-wise, I think changes are needed as well:
https://github.com/CanonicalLtd/maas-docs/issues/737 - Routing Domain,
L3VPN or VRF are common names for what we refer to as a Network Space,
and what is actually a virtual L3 network with its own complete address
space, routing table copies and dedicated host/router physical or
logical interfaces.

Examples:

* https://routingnull0.com/2015/12/14/mpls-l3vpns-part-2/ case 4 here maps MPLS 
& L3VPN concepts to leaf-spine
* http://packetlife.net/blog/2014/apr/15/deploying-datacenter-mpls-vpn-junos/ - 
leaf-spine + MPLS

Analogies (not related to computer networking):
https://paste.ubuntu.com/26227512/

** Bug watch added: github.com/CanonicalLtd/maas-docs/issues #737
   https://github.com/CanonicalLtd/maas-docs/issues/737

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1737428

Title:
  VRF support to solve routing problems associated with multi-homing

Status in juju:
  New
Status in MAAS:
  Incomplete
Status in linux package in Ubuntu:
  Incomplete

Bug description:
  Problem description:

  * a host is multi-homed if it has multiple network interfaces with L3
  addresses configured (physical or virtual interfaces, natural to
  OpenStack regardless of IPv4/IPv6 and IPv6 in general);

  (see 3.3.4  Local Multihoming
  https://tools.ietf.org/html/rfc1122#page-60 and 3.3.4.2  Multihoming
  Requirements)

  * if all hosts that need to participate in L3 communication are
  located on the same L2 network there is no need for a routing device
  to be present. ARP/NDP and auto-created directly connected routes are
  enough;

  * multi-homing with hosts located on different L2 networks requires more 
intelligent routing:
    - "directly connected" routes are no longer enough to talk to all relevant 
hosts in the same network space;
    - a default gateway in the main routing table may not be the correct 
routing device that knows where to forward traffic (management network traffic 
goes to a management switch and router, other traffic goes to L3 ToR switch but 
may go via different bonds);
    - even if a default gateway knows where to forward traffic, it may not be 
the intended physical path (storage replication traffic must go through a 
specific outgoing interface, not the same interface as storage access traffic 
although both interfaces are connected to the same ToR);
    - there is no longer a single "default gateway" as applications need either 
per-logical-direction routers or to become routers themselves (if destination 
== X, forward to next-hop Y). Leaf-spine architecture is a good example of how 
multiple L2 networks force you to use spaces that have VLANs in different 
switch fabrics => one or more hops between hosts with interfaces associated 
with the same network space;
    - while network spaces implicitly require L3 reachability between each host 
that has a NIC associated with a network space, the current definition does not 
mention routing infrastructure required for that. For a single L2 this problem 
is hidden by directly connected routes, for multi-L2, no solution is provided 
or discussed;

  * existing solutions to multi-homing require routing table management
  on a given host: complex static routing rules, dynamic routing (e.g.
  running an OSPF or BGP daemon on a host);

  * using static routes is rigid and requires network planning (i.e.
  working with network engineers which may have varying degrees of
  experience, doing VLSM planning etc.);

  * using dynamic routing requires a broader integration into an
  organization's L3 network infrastructure. Routing can be implemented
  differently across different organizations and it is a security and
  operational burden to integrate with a company's routing
  infrastructure.

  Summary: a mechanism is needed to associate an interface with a
  forwarding table (FIB) which has its own default gateway and make an
  application with a listen(2)ing socket(2) return connected sockets
  associated with different FIBs. In other words, applications need to
  implicitly get source/destination-based routing capabilities without
  the need to use static routing schemes or dynamic routing and with
  minimum or no modifications to the applications themselves.

  Goals:

  * avoid turning individual hosts into routers;
  * avoid complex static rules;
  * better support multi-fabric deployments with minimum effort (Juju, charms, 
MAAS, applications, network infrastructure);
  * reduce operational complexity (custom L3 infrastructure integration for 
each deployment);
  * reduce delivery risks (L3 infrastructure, L3 department responsiveness 
varies);
  * avoid any form of L2 stretching at the infrastructure level - this is 
inefficient for various reasons.

  NOTE: https://cumulusnetworks.com/blog/vrf-for-linux/ - I recommend to
  read this post to understand suggestions below.

  How to solve it?

  What does it mean for Juju to support VRF devices?

  * enslave certain devices on provisioning based on network space information 
(physical NICs, VLAN devices, bonds AND bridges created for containers must be 
considered) - VRF devices logically enslave devices similar to bridges but work 
differently (on L3, not L2);
  * the above is per network namespace so it will work equally well in a LXD 
container;

  Conceptually:

  # echo 'net.ipv4.tcp_l3mdev_accept = 1' >> /etc/sysctl.conf
  # echo 'net.ipv4.udp_l3mdev_accept = 1' >> /etc/sysctl.conf
  # sysctl -p

  # # create additional routing tables
  # cat >> /etc/iproute2/rt_tables.d/vrf.conf <<EOF
  1  mgmt
  10 pub
  20 storacc
  30 storrepl
  EOF

  # # populate per-routing table default gateways
  # ip route add mgmt default via 192.168.0.1
  # ip route add pub default via 172.16.0.1
  # ip route add storacc default via 10.10.4.1
  # ip route add storrepl default via 10.10.5.1

  # # add and bring up VRF devices
  # ip link add mgmt type vrf table 1 && ip link set dev mgmt up
  # ip link add pub type vrf table 10 && ip link set dev pub up
  # ip link add storacc type vrf table 20 && ip link set dev storacc up
  # ip link add storrepl type vrf table 30 && ip link set dev storrepl up

  # # enslave actual devices to VRF devices
  # ip link set mgmtbr0 master mgmt
  # ip link set pubbr0 master pub
  # ip link set storaccbr0 master storacc
  # ip link set storreplbr0 master storrepl

  # make your services use INADDR_ANY for listening sockets in charms if
  not done already (use 0.0.0.0)

  charm-related:

  * (no-op) services with listening sockets on INADDR_ANY will not need
  any modifications either on the charm side or at the application level
  - this is the cheapest way to solve multi-homing problems;

  * (later) a more advanced functionality for applications that do not
  use INADDR_ANY but bind a listening socket to a specific address -
  this requires `ip vrf exec` functionality in iproute2 or application
  modifications.

  Notes:

  * Let's follow rule number 6 (https://tools.ietf.org/html/rfc1925) and move 
routing problems to L3 departments. Juju deploy "router" is a different 
scenario which should reside on a model separate from IAAS;
  * We are not turning hosts into routers with this - this is a way to move 
routing decisions to the next hop which is available on a directly connected 
route. The problem we are solving here is N next hops instead of just one. 
Those hops can worry about administrative distance/different routing protocols, 
route costs/metrics, routing protocol peer authentication etc.
  * Linux kernel functionality was mostly upstreamed in 4.4;
  * Linux kernel only while a unit agent can run on Windows too (nothing we can 
do here).

  Implementation description:

  1. Kernel

  4.4 (GA xenial)

  * CONFIG_NET_VRF=m - present in xenial GA kernels
  
http://kernel.ubuntu.com/git/ubuntu/ubuntu-xenial.git/tree/debian.master/config/config.common.ubuntu?id=2c5158e82d497c5eb90d6e2b8aaf07d36cb175f6#n5172

  * CONFIG_NET_L3_MASTER_DEV=y - present in xenial GA kernels
  
http://kernel.ubuntu.com/git/ubuntu/ubuntu-xenial.git/tree/debian.master/config/config.common.ubuntu?id=2c5158e82d497c5eb90d6e2b8aaf07d36cb175f6#n5109

  backports needed from 4.5 - required for VRF-unaware applications that
  use INADDR_ANY:

  6dd9a14e92e54895e143f10fef4d0b9abe109aa9 (tcp_l3mdev_accept)
  63a6fff353d01da5a22b72670c434bf12fa0e3b8 (udp_l3mdev_accept)

  only `ip vrf exec` related - NOT required for baseline functionality:

  * http://man7.org/linux/man-pages/man8/ip-vrf.8.html CGROUPS and
  CGROUP_BPF enabled - xenial HWE only (not HWE-edge)

  2. User space (iproute2)

  iproute2 supports the vrf keyword in a version packaged with Ubuntu
  16.04.

  More specific functionality like `ip vrf exec <vrf-name>` is available
  in later versions:

  
https://git.kernel.org/pub/scm/linux/kernel/git/shemminger/iproute2.git/commit/?id=1949f82cdf62c074562f04acfbce40ada0aac7e0
  git tag --contains=1949f82cdf62c074562f04acfbce40ada0aac7e0
  v4.10.0
  v4.11.0
  ...

  3. MAAS - already hands over per-subnet default gateways

  
https://github.com/maas/maas/blob/2.3.0/src/maasserver/models/node.py#L3325-L3360
  
https://github.com/maas/maas/blob/2.3.0/src/maasserver/api/machines.py#L363-L378

  4. Juju and/or MAAS:

  * create per-network-space routing tables (default gateways must be taken 
from subnets in MAAS - subnets related to the same space will have different 
default gateways)
  * create VRF devices relevant to network spaces;
  * enslave interfaces to VRF devices (this includes Linux bridges created by 
Juju for containers).

  5. Nothing for baseline functionality other than configuring software
  to use 0.0.0.0 (INADDR_ANY or "all interfaces") for listening sockets.

  (future work) configure software to use `ip vrf exec` even if it
  doesn't support VRFs directly when INADDR_ANY is not used.

  See https://www.kernel.org/doc/Documentation/networking/vrf.txt, note
  that setsockopt requirement is worked around via `ip vrf exec` in
  iproute2 (no need to rewrite every application):

  "Applications that are to work within a VRF need to bind their socket
  to the VRF device:

  setsockopt(sd, SOL_SOCKET, SO_BINDTODEVICE, dev, strlen(dev)+1);

  or to specify the output device using cmsg and IP_PKTINFO.

  TCP & UDP services running in the default VRF context (ie., not bound
  to any VRF device) can work across ***all VRF domains*** by enabling
  the tcp_l3mdev_accept and udp_l3mdev_accept sysctl options:

  sysctl -w net.ipv4.tcp_l3mdev_accept=1
  sysctl -w net.ipv4.udp_l3mdev_accept=1"

  http://man7.org/linux/man-pages/man8/ip-vrf.8.html
  "This ip-vrf command is a helper to run a command against a specific VRF with 
the VRF association ***inherited parent to child***."

  References:

  https://en.wikipedia.org/wiki/Multihoming
  http://blog.ipspace.net/2016/04/host-to-network-multihoming-kludges.html
  http://blog.ipspace.net/2010/09/ribs-and-fibs.html

  https://cumulusnetworks.com/blog/vrf-for-linux/ <--- this is a must-
  read

  
https://docs.cumulusnetworks.com/display/DOCS/Virtual+Routing+and+Forwarding+-+VRF

  http://netdevconf.org/1.2/session.html?david-ahern-talk

  https://www.kernel.org/doc/Documentation/networking/vrf.txt

  https://github.com/Mellanox/mlxsw/wiki/Virtual-Routing-and-
  Forwarding-%28VRF%29

  http://blog.ipspace.net/2016/02/running-bgp-on-servers.html
  https://tools.ietf.org/html/rfc7938

  http://www.routereflector.com/2016/11/working-with-vrf-on-linux/
  (usage example on 16.04)

To manage notifications about this bug go to:
https://bugs.launchpad.net/juju/+bug/1737428/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1737428] Re: VRF support to solve routing problems associated with multi-homing

Reply via email to