Hi, I'd like to start a discussion about virtual PCIe link width and speeds in QEMU to figure out how we progress past the 2.5GT/s, x1 width links we advertise today. This matters for assigned devices as the endpoint driver may not enable full physical link utilization if the upstream port only advertises minimal capabilities. One GPU assignment users has measured that they only see an average transfer rate of 3.2GB/s with current code, but hacking the downstream port to advertise an 8GT/s, x16 width link allows them to get 12GB/s. Obviously not all devices and drivers will have this dependency and see these kinds of improvements, or perhaps any improvement at all.
The first problem seems to be how we expose these link parameters in a way that makes sense and supports backwards compatibility and migration. I think we want the flexibility to allow the user to specify per PCIe device the link width and at least the maximum link speed, if not the actual discrete link speeds supported. However, while I want to provide this flexibility, I don't necessarily think it makes sense to burden the user to always specify these to get reasonable defaults. So I would propose that we a) add link parameters to the base PCIe device class and b) set defaults based on the machine type. Additionally these machine type defaults would only apply to generic PCIe root ports and switch ports, anything based on real hardware would be fixed, ex. ioh3420 would stay at 2.5GT/s, x1 unless overridden by the user. Existing machine types would also stay at this "legacy" rate, while pc-q35-3.2 might bring all generic devices up to PCIe 4.0 specs, x32 width and 16GT/s, where the per-endpoint negotiation would bring us back to negotiated widths and speeds matching the endpoint. Reasonable? Next I think we need to look at how and when we do virtual link negotiation. We're mostly discussing a virtual link, so I think negotiation is simply filling in the negotiated link and width with the highest common factor between endpoint and upstream port. For assigned devices, this should match the endpoint's existing negotiated link parameters, however, devices can dynamically change their link speed (perhaps also width?), so I believe a current link seed of 2.5GT/s could upshift to 8GT/s without any sort of visible renegotiation. Does this mean that we should have link parameter callbacks from downstream port to endpoint? Or maybe the downstream port link status register should effectively be an alias for LNKSTA of devfn 00.0 of the downstream device when it exists. We only need to report a consistent link status value when someone looks at it, so reading directly from the endpoint probably makes more sense than any sort of interface to keep the value current. If we take the above approach with LNKSTA (probably also LNKSTA2), is any sort of "negotiation" required? We're automatically negotiated if the capabilities of the upstream port are a superset of the endpoint's capabilities. What do we do and what do we care about when the upstream port is a subset of the endpoint though? For example, an 8GT/s, x16 endpoint is installed into a 2.5GT/s, x1 downstream port. On real hardware we obviously negotiate the endpoint down to the downstream port parameters. We could do that with an emulated device, but this is the scenario we have today with assigned devices and we simply leave the inconsistency. I don't think we actually want to (and there would be lots of complications to) force the physical device to negotiate down to match a virtual downstream port. Do we simply trigger a warning that this may result in non-optimal performance and leave the inconsistency? This email is already too long, but I also wonder whether we should consider additional vfio-pci interfaces to trigger a link retraining or allow virtualized access to the physical upstream port config space. Clearly we need to consider multi-function devices and whether there are useful configurations that could benefit from such access. Thanks for reading, please discuss, Alex