Then a matrix of how each requires what modifications in the network
code. Of course all players need to agree that the description is
accurate.
Is there such a document?
cheers,
jamal
Hi,
the attached document describes the network isolation at the layer 2 and
at the layer 3, it presents the pros and cons of the different
approaches, their common points and the impacted network code.
I hope it will be helpful :)
Cheers.
-- Daniel
Isolating and virtualizing the network
--------------------------------------
Some definitions:
-----------------
isolation : This is a restrictive technique which divides a set of the
available system objects to smaller subsets assigned to a group of
processes. This technique ensures an application will use only a
subset of the system resources and will never access other
resources.
virtualization : This technique gives the illusion to an application
that its owns all the system resources instead of a subset of them
provided by the isolation.
container: it is the name of the base element which brings the
isolation and the virtualization where applications are running into.
system container: operating system running inside a container.
application container : application running inside a container.
checkpoint/restart: take a snapshot of a container at a given time
and recreate the container from this snapshot.
mobility: checkpoint/restart used to move a container to one host to
another host.
----------------------------
Actually, containers are being developed in the kernel with the
following functions :
* separate the system resources between containers in order
to avoid an application, running into a container, to
access the resources outside the container. That
facilitates the resources management, ensures the
application is jailed and increases the security.
* virtualize the resources, that avoids resources conflict
between containers, that allows to run several instance of
the same servers without modifying its network
configuration.
* the combination of the isolation and the virtualization is
the base for the checkpoint/restart. The checkpoint is
easier because the resources are identified by container
and the restart is possible because the applications can
be recreated with the same resources identifier without
conflicts. For example, the application has the pid 1000,
it is checkpointed and when it is restarted the same pid
is assigned to it and it will not conflict because pids are
isolated and virtualized.
In all the system resources, the network is one of the biggest part
to isolate and virtualize. Some solutions were proposed, with
different approaches and different implementations.
Layer 2 isolation and virtualization
------------------------------------
The virtualization acts at the network device level. The routes and
the sockets are isolated. Each container has its own network device
and its own routes. The network must be configured in each container.
This approach brings a very strong isolation and a perfect
virtualization for the system containers.
- Ingress traffic
The packets arrive to the real network device, outside of the
container. Depending on the destination, the packets are forwarded to
the network device assigned to the container. From this point, the
path is the same and the packets go through the routes and the sockets
layer because they are isolated into the container.
- Outgoing traffic
The packets go through the sockets, the routes, the network device
assigned to the container and finally to the real device.
Implementation:
---------------
Andrey Savochkin, from OpenVZ team, patchset of this approach uses the
namespace concept. All the network devices are no longer stored into
the "dev_base_list" but into a list stored into the network namespace
structure. Each container has its own network namespace. The network
device access has been changed to access the network device list
relative to the current namespace's context instead of the global
network device list. The same has been made for the routing tables,
they are all relatives to the namespace and are no longer global
static. The creation of a new network namespace implies the creation
of a new set of routing table.
After the creation of a container, no network device exists. It is
created from outside by the container's parent. The communication
between the new container and the outside is done via a special pair
device which have each extremities into each namespace. The MAC
addresses must be specified and these addresses should be handled by
the containers developers in order to ensure MAC unicity.
After this network device creation step into each namespace, the
network configuration is done as usual, in other words, with a new
operating system initialization or with the 'ifconfig' or 'ip'
command.
----- ------ ------- ------ ----
| LAN |<->| eth0 |<->| veth0 |<-|ns(1)|->| eth0 |<->| IP |
----- ------ ------- ------ ----
(1) : ns = namespace (aka. Virtual Environment).
The advantages of this implementation is the algorithms used by the
network stack are not touched, only the network data access is
modified. That's facilitate the maintenance and the evolution of the
network code. The drawback is in the case of application container,
the number of containers can be much more important, (hundred of
them), that implies a number of network devices more important, a
longer path to go through the virtualization layer and a more
resources consumption.
Layer 3 isolation and virtualization
------------------------------------
The virtualization acts at the IP level. The routes can be isolated
and the sockets are isolated.
This approach does not bring isolation at the network device
layer. The isolation and the virtualization is less stronger than the
layer 2 but it presents a negligible overhead and resource
consumption near from the non virtualized environment. Furthermore,
the isolation at the IP level makes the administration very easy.
- Ingress traffic
The packets arrive to the real device and go through the routes
engine. From this point, the used route is enough to know to which
container the traffic can go and the sockets subset assigned to the
container.
- Outgoing traffic:
The packets go through the sockets, the assigned routes and finally to
the real device.
The socket are isolated for each container, the current container
context is used to retrieve the IP address owned by the
container. When the source address is not specified, the owned IP is
used to fill the source address of the packet. This is done when doing
raw, icmp, multicast, broadcast, tcp connection and udp send
message. If the bind is done on the interface instead of a ip address,
the source address should be checked to be owned by the container too.
Implementation:
---------------
Concerning the implementation, several solutions exist. All of them
rely to the namespace concept but instead of having all the network
resources relative to the namespace, the namespace pointer is used as
an identifier.
One of these solutions is the bind filtering. This implementation is
the simplest to realize but it brings little isolation. If a mobility
solution must be implemented on the top of that isolation, the bind
filtering should be coupled with the socket isolation. The bind
filtering consists in placing several hooks at some strategic points
into function calls (bind, connect, send datagram, etc ...) in order
to fill source address and avoid the bind to an IP address outside of
the container. The container destination should be determined from the
ingress traffic.
The second solution consists in relying on the route engine to ensure
the isolation. The routes are all accessible from all the namespaces
but they contain the information of what namespace they belong. By
this way, when the traffic is outgoing, only the routes belonging to
the namespace are used. When the traffic is incoming, it goes through
a route, because this one has the namespace owner information, the
traffic can go to the right namespace. The advantage of this approach
is to have an isolation near of what can provide the layer 2 isolation
for the IP layer without loss of performances. The drawback is the
complexity of the code which is strongly linked with the routing
algorithms and that's do not facilitate the maintenance.
----- ------ -----------
| LAN |<->| eth0 |<->| ns(1)| IP |
----- ------ -----------
(1) : ns = namespace (aka. Virtual Environmenent).
Common points between layer 2 and layer 3 implementations
---------------------------------------------------------
Because the need of the sockets isolation is the same for the layer 2
and the layer 3, the socket isolation is the same for the two
approaches.
The t-uple key, source address, source port, destination address,
destination port is extended with the network namespace. At the bind
time, the port usage verification is extended with the network
namespace pointer too. A port is already in use only if the port and
network namespace match. If the port match but namespace does not
match, that means the port is in use but in another namespace.
There can be several listening point on the same port with source
address set to inaddr_any. When an incoming connection arrives, the
namespace destination is already resolved and the right connection is
found without ambiguity.
Network resources
------------------
L2 : Layer 2
L3 : Layer 3
BF : Bind Filtering
------------------------------------------
| L2 | L3 | BF |
---------------------------------------------------------------------
| Sockets | Isolated | Isolated | Isolated(1) |
---------------------------------------------------------------------
| Routes | Isolated | Isolated(2) | X |
---------------------------------------------------------------------
| Inetdev | Virtualized | Virtualized | X |
---------------------------------------------------------------------
| Network devices | Virtualized | X | X |
---------------------------------------------------------------------
(1) : The socket should be isolated, in the case of mobility
(2) : The routes can be isolated or not
Network code modifications
--------------------------
------------------------------------------------
| L2 | L3 | BF |
--------------------------------------------------------------------
| | struct sock | | |
| Sockets | hash tables | idem | idem(1) |
| | async sock event | | |
--------------------------------------------------------------------
| Routes | routes table | route cache | X |
| | | route resolver | |
--------------------------------------------------------------------
| | | struct ifaddr | |
| Inetdev | X | add addr | X |
| | | del addr | |
| | | gifconf | |
--------------------------------------------------------------------
| | specific net dev | | |
| Netdevice | loopback | X | X |
| | dev list | | |
--------------------------------------------------------------------
(1) if mobility is needed
Solution pros/cons
------------------
-----------------------------------------
| L2 | L3 | BF |
--------------------------------------------------------------------
| Isolation | Excellent | Good | Weak |
--------------------------------------------------------------------
| Virtualization | Total | Partial | None |
--------------------------------------------------------------------
| Network setup | Complicated | Trivial | Simple |
--------------------------------------------------------------------
| Overhead | High | Negligible | Negligible |
--------------------------------------------------------------------
More information:
-----------------
The container mobility paper at Ottawa Linux Symposium :
http://lxc.sourceforge.net/doc/ols2006/lxc-ols2006.pdf
The OpenVZ wiki comparing layer 2 and layer 3 :
http://wiki.openvz.org/Containers/Network_virtualization