Re: [Qemu-devel] QEMU PCIe link "negotiation"

Alex Williamson Tue, 16 Oct 2018 09:46:47 -0700

On Tue, 16 Oct 2018 11:21:28 -0400
"Michael S. Tsirkin" <m...@redhat.com> wrote:

> On Mon, Oct 15, 2018 at 02:18:41PM -0600, Alex Williamson wrote:
> > Hi,
> > 
> > I'd like to start a discussion about virtual PCIe link width and speeds
> > in QEMU to figure out how we progress past the 2.5GT/s, x1 width links
> > we advertise today.  This matters for assigned devices as the endpoint
> > driver may not enable full physical link utilization if the upstream
> > port only advertises minimal capabilities.  One GPU assignment users
> > has measured that they only see an average transfer rate of 3.2GB/s
> > with current code, but hacking the downstream port to advertise an
> > 8GT/s, x16 width link allows them to get 12GB/s.  Obviously not all
> > devices and drivers will have this dependency and see these kinds of
> > improvements, or perhaps any improvement at all.
> > 
> > The first problem seems to be how we expose these link parameters in a
> > way that makes sense and supports backwards compatibility and
> > migration.  
> 
> Isn't this just for vfio though? So why worry about migration?

Migration is coming for vfio devices, mdev devices in the near(er)
term, but I wouldn't be too terribly surprised to see device specific
migration support either.  Regardless, we support hotplug of vfio
devices therefore we cannot only focus on cold-plug scenarios and any
hotplug scenario must work irrespective of whether the VM has been
previously migrated.  If we start with a x16/8GT root port with an
assigned GPU, unplug the GPU, migrate, and hot-add a GPU on the target,
it might behave differently if that root port is only exposing x1/2.5GT
capabilities.

I did consider whether devices can dynamically change their speed and
width capabilities, for instance the supported speeds vector in LNKCAP2
is indicated as hardware-init, so I think software would reasonable
expect that these values cannot change, however the max link speed and
max link width values in LNKCAP are simply read-only.  Flirting with
which registers software might consider dynamic, when they're clearly
not dynamic on real hardware seems troublesome though.

> >  I think we want the flexibility to allow the user to
> > specify per PCIe device the link width and at least the maximum link
> > speed, if not the actual discrete link speeds supported.  However,
> > while I want to provide this flexibility, I don't necessarily think it
> > makes sense to burden the user to always specify these to get
> > reasonable defaults.  So I would propose that we a) add link parameters
> > to the base PCIe device class and b) set defaults based on the machine
> > type.  Additionally these machine type defaults would only apply to
> > generic PCIe root ports and switch ports, anything based on real
> > hardware would be fixed, ex. ioh3420 would stay at 2.5GT/s, x1 unless
> > overridden by the user.  Existing machine types would also stay at this
> > "legacy" rate, while pc-q35-3.2 might bring all generic devices up to
> > PCIe 4.0 specs, x32 width and 16GT/s, where the per-endpoint
> > negotiation would bring us back to negotiated widths and speeds
> > matching the endpoint.  Reasonable?  
> 
> Generally yes.  Last time I looked, there's a bunch of stuff the spec
> says we need to do for the negotiation. E.g. guest can at any time
> request width re-negotiation. Maybe most guests don't do it but it's
> still in the spec and we never know whether anyone will do it in the
> future.
> 
> VFIO is often a compromise but for virtual devices
> I'd prefer we are stictly compliant if possible.

I would also want to be as spec compliant as possible and we'll need to
think about how to incorporate things like link change notifications,
these may require additional support from vfio if we can capture the
event on the host and plumb it through the virtual downstream port.  In
general though, I think retraining a link width changes will be rather
transparently handled if the downstream port defers to mirroring the
link status of the connected endpoint.  I'll try to look specifically
at each interaction for compliance, but if you have any specific
things you think are going to be troublesome, please let me know. 

> > Next I think we need to look at how and when we do virtual link
> > negotiation.  We're mostly discussing a virtual link, so I think
> > negotiation is simply filling in the negotiated link and width with the
> > highest common factor between endpoint and upstream port.  For assigned
> > devices, this should match the endpoint's existing negotiated link
> > parameters, however, devices can dynamically change their link speed
> > (perhaps also width?), so I believe a current link seed of 2.5GT/s
> > could upshift to 8GT/s without any sort of visible renegotiation.  Does
> > this mean that we should have link parameter callbacks from downstream
> > port to endpoint?  Or maybe the downstream port link status register
> > should effectively be an alias for LNKSTA of devfn 00.0 of the
> > downstream device when it exists.  We only need to report a consistent
> > link status value when someone looks at it, so reading directly from
> > the endpoint probably makes more sense than any sort of interface to
> > keep the value current.  
> 
> Don't we need to reflect the physical downstream link speed
> somehow though?

The negotiated physical downstream port speed and width must match the
endpoint's speed and width, so I think the only concern here is that we
might have a mismatch of capabilities, right?  I'm not sure we have an
alternative though.  If the root port capabilities need to match the
physical device, then we've essentially precluded hotplug unless we're
going to suggest that we always hot-add a matching root port, into
which we'll then hot-add the assigned device.  Therefore I favored the
approach of simply over-spec'ing the virtual devices and I think there
are physical precedents for such as well.  For example, there exist a
range of passive adapter and expansion devices for PCIe which can
change the width and may also restrict the speed.  A x16 endpoint may
only negotiate a x1 width even though both the endpoint and slot are
x16 capable if one of these[1] is interposed between them.  The link
speed may be similarly restricted with one of these[2].

[1]https://www.amazon.com/gp/product/B0039XPS5W/
[2]https://www.amazon.com/Laptop-External-PCI-Graphics-Card/dp/B00Q4VMLF6

In the scheme I propose, the user would have the ability to set the
root port to speeds and widths that match the physical device, but the
default case would be to effectively over-provision the virtual device.

> > If we take the above approach with LNKSTA (probably also LNKSTA2), is
> > any sort of "negotiation" required?  We're automatically negotiated if
> > the capabilities of the upstream port are a superset of the endpoint's
> > capabilities.  What do we do and what do we care about when the
> > upstream port is a subset of the endpoint though?  For example, an
> > 8GT/s, x16 endpoint is installed into a 2.5GT/s, x1 downstream port.
> > On real hardware we obviously negotiate the endpoint down to the
> > downstream port parameters.  We could do that with an emulated device,
> > but this is the scenario we have today with assigned devices and we
> > simply leave the inconsistency.  I don't think we actually want to
> > (and there would be lots of complications to) force the physical device
> > to negotiate down to match a virtual downstream port.  Do we simply
> > trigger a warning that this may result in non-optimal performance and
> > leave the inconsistency?  
> 
> Also when guest pokes at the width do we need to tweak the
> physical device/downstream port?

How does the guest poke at width?  The guest can set a link target
speed in LNKCTL2, but the width seems to be only negotiated at the
physical level.  AIUI the standard procedure would be for a driver to
set a target link speed and then retrain the link from the downstream
port to implement that request.  The retraining may or may not achieve
the target link speed and retraining is transparent to the data layer
of the link, therefore it seems safe to do nothing on retraining, but
we may want to consider it as a future enhancement.  Physical link
retraining also gets us into scenarios where we need to think about
multifunction end points and the ownership of the endpoints affected by
a link retraining.  It might be like the bus reset support where the
user needs to own all the affected devices to initiate a link
retraining.  I don't think anything here precludes that, it'd be an
additional callback from the downstream port to the devfn 00.0 endpoint
to initiate a retraining and vfio would need to figure out what it can
do with that.  Thanks,

Alex

Re: [Qemu-devel] QEMU PCIe link "negotiation"

Reply via email to