On Thu, Jun 06, 2019 at 12:09:12PM +0200, Stefano Garzarella wrote: > > Hi all, > this is a v2 of a proposal addressing the comments made by Dexuan, Stefan, > and Jorgen. > > v1: https://www.spinics.net/lists/netdev/msg570274.html > > > > We can define two types of transport that we have to handle at the same time > (e.g. in a nested VM we would have both types of transport running together): > > - 'host->guest' transport, it runs in the host and it is used to communicate > with the guests of a specific hypervisor (KVM, VMWare or Hyper-V). It also > runs in the guest who has nested guests, to communicate with them. > > [Phase 2] > We can support multiple 'host->guest' transport running at the same time, > but on x86 only one hypervisor uses VMX at any given time. > > - 'guest->host' transport, it runs in the guest and it is used to communicate > with the host. > > > The main goal is to find a way to decide what transport use in these cases: > 1. connect() / sendto() > > a. use the 'host->guest' transport, if the destination is the guest > (dest_cid > VMADDR_CID_HOST). > > [Phase 2] > In order to support multiple 'host->guest' transports running at the > same > time, we should assign CIDs uniquely across all transports. In this way, > a packet generated by the host side will get directed to the appropriate > transport based on the CID. > > b. use the 'guest->host' transport, if the destination is the host or the > hypervisor. > (dest_cid == VMADDR_CID_HOST || dest_cid == VMADDR_CID_HYPERVISOR) > > > 2. listen() / recvfrom() > > a. use the 'host->guest' transport, if the socket is bound to > VMADDR_CID_HOST, or it is bound to VMADDR_CID_ANY and there is no > 'guest->host' transport. > We could also define a new VMADDR_CID_LISTEN_FROM_GUEST in order to > address this case. > > [Phase 2] > We can support network namespaces to create independent AF_VSOCK > addressing domains: > - could be used to partition VMs between hypervisors or at a finer > granularity; > - could be used to isolate host applications from guest applications > using the same ports with CID_ANY; > > b. use the 'guest->host' transport, if the socket is bound to local CID > different from the VMADDR_CID_HOST (guest CID get with > IOCTL_VM_SOCKETS_GET_LOCAL_CID), or it is bound to VMADDR_CID_ANY (to be > backward compatible). > Also in this case, we could define a new VMADDR_CID_LISTEN_FROM_HOST. > > c. shared port space between transports > For incoming requests or packets, we should be able to choose which > transport use, looking at the 'port' requested. > > - stream sockets already support shared port space between transports > (one port can be assigned to only one transport) > > [Phase 2] > - datagram sockets will support it, but for now VMCI transport is the > default transport for any host side datagram socket (KVM and Hyper-V > do not yet support datagrams sockets) > > We will make the loading of af_vsock.ko independent of the transports to > allow to: > - create a AF_VSOCK socket without any loaded transports; > - listen on a socket (e.g. bound to VMADDR_CID_ANY) without any loaded > transports; > > Hopefully, we could move MODULE_ALIAS_NETPROTO(PF_VSOCK) from the > vmci_transport.ko to the af_vsock.ko. > [Jorgen will check if this will impact the existing VMware products] > > Notes: > - For Hyper-V sockets, the host can only be Windows. No changes should > be required on the Windows host to support the changes on this proposal. > > - Communication between guests are not allowed on any transports, so we can > drop packets sent from a guest to another guest (dest_cid > > VMADDR_CID_HOST) if the 'host->guest' transport is not available. > > - [Phase 2] tag used to identify things that can be done at a later stage, > but that should be taken into account during this design. > > - Namespace support will be developed in [Phase 2] or in a separate > project. > > > > Comments and suggestions are welcome. > I'll be on PTO for next two weeks, so sorry in advance if I'll answer later. > > If we agree on this proposal, when I get back, I'll start working on the code > to get a first PATCH RFC.
Stefano, I've reviewed your proposal and it looks good for solving nested virtualization. The tricky implementation details will be supporting listen sockets, especially with VMADDR_CID_ANY so they can be accessed from both transports. Stefan
signature.asc
Description: PGP signature