Hi all, this is a v2 of a proposal addressing the comments made by Dexuan, Stefan, and Jorgen.
v1: https://www.spinics.net/lists/netdev/msg570274.html We can define two types of transport that we have to handle at the same time (e.g. in a nested VM we would have both types of transport running together): - 'host->guest' transport, it runs in the host and it is used to communicate with the guests of a specific hypervisor (KVM, VMWare or Hyper-V). It also runs in the guest who has nested guests, to communicate with them. [Phase 2] We can support multiple 'host->guest' transport running at the same time, but on x86 only one hypervisor uses VMX at any given time. - 'guest->host' transport, it runs in the guest and it is used to communicate with the host. The main goal is to find a way to decide what transport use in these cases: 1. connect() / sendto() a. use the 'host->guest' transport, if the destination is the guest (dest_cid > VMADDR_CID_HOST). [Phase 2] In order to support multiple 'host->guest' transports running at the same time, we should assign CIDs uniquely across all transports. In this way, a packet generated by the host side will get directed to the appropriate transport based on the CID. b. use the 'guest->host' transport, if the destination is the host or the hypervisor. (dest_cid == VMADDR_CID_HOST || dest_cid == VMADDR_CID_HYPERVISOR) 2. listen() / recvfrom() a. use the 'host->guest' transport, if the socket is bound to VMADDR_CID_HOST, or it is bound to VMADDR_CID_ANY and there is no 'guest->host' transport. We could also define a new VMADDR_CID_LISTEN_FROM_GUEST in order to address this case. [Phase 2] We can support network namespaces to create independent AF_VSOCK addressing domains: - could be used to partition VMs between hypervisors or at a finer granularity; - could be used to isolate host applications from guest applications using the same ports with CID_ANY; b. use the 'guest->host' transport, if the socket is bound to local CID different from the VMADDR_CID_HOST (guest CID get with IOCTL_VM_SOCKETS_GET_LOCAL_CID), or it is bound to VMADDR_CID_ANY (to be backward compatible). Also in this case, we could define a new VMADDR_CID_LISTEN_FROM_HOST. c. shared port space between transports For incoming requests or packets, we should be able to choose which transport use, looking at the 'port' requested. - stream sockets already support shared port space between transports (one port can be assigned to only one transport) [Phase 2] - datagram sockets will support it, but for now VMCI transport is the default transport for any host side datagram socket (KVM and Hyper-V do not yet support datagrams sockets) We will make the loading of af_vsock.ko independent of the transports to allow to: - create a AF_VSOCK socket without any loaded transports; - listen on a socket (e.g. bound to VMADDR_CID_ANY) without any loaded transports; Hopefully, we could move MODULE_ALIAS_NETPROTO(PF_VSOCK) from the vmci_transport.ko to the af_vsock.ko. [Jorgen will check if this will impact the existing VMware products] Notes: - For Hyper-V sockets, the host can only be Windows. No changes should be required on the Windows host to support the changes on this proposal. - Communication between guests are not allowed on any transports, so we can drop packets sent from a guest to another guest (dest_cid > VMADDR_CID_HOST) if the 'host->guest' transport is not available. - [Phase 2] tag used to identify things that can be done at a later stage, but that should be taken into account during this design. - Namespace support will be developed in [Phase 2] or in a separate project. Comments and suggestions are welcome. I'll be on PTO for next two weeks, so sorry in advance if I'll answer later. If we agree on this proposal, when I get back, I'll start working on the code to get a first PATCH RFC. Cheers, Stefano