@ahasenack I checked internally and the reason why the issue was
observed was because the test was carried out on the migration source
host when it needs to be performed on the destination (which is what we
did to verify it). Admittedly the test plan did not make that clear so I
am updating it to reflect that.

** Description changed:

  [Impact]
  
  Nova suffers from a race condition when it does live migrations of vms
  with SRIOV ports whereby a pre-check of available ports and their
  capabilities can error if one or more ports becomes unavailable during
  the check. The fix backported here simply ignores libvirt errors when
  checking device capabilities resulting in those that throw an error
  being ignored.
  
  [Test Plan]
  
  Since the bug is a race condition it can be hard to reproduce but a
  succession of live migrations between SRIOV capable nodes with a
  reasonably large quantity of VFs should be a reasonable test.
  
  * deploy OpenStack Yoga with SRIOV capable hardward
  * create 10 vms with e.g. 5 sriov ports
- * live migrate the vms between the hosts and check for the Traceback in 
/var/log/nova/nova-compute.log
+ * live migrate the vms between the hosts and check for the Traceback in 
/var/log/nova/nova-compute.log on the migration destination host (which must be 
running with this patch)
  
  [Regression Potential]
  This patch is not anticipated to introduce any regressions.
  -------------------------------------------------
  
  At the moment, the `_get_pci_passthrough_devices` function is prone to
  race conditions.
  
  This specific code here calls `listCaps()`, however, it is possible that
  the device has disappeared by the time on method has been called:
  
  
https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L7949-L7959
  
  Which would result in the following traceback:
  
  2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager 
[req-51b7c1c4-2b4a-46cc-9baa-8bf61801c48d - - - - -] Error updating resources 
for node <snip>.: libvirt.libvirtError: Node device not found: no node device 
with matching name 'net_tap8b08ec90_e5_fe_16_3e_0f_0a_d4'
  2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager Traceback (most 
recent call last):
  2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager   File 
"/var/lib/openstack/lib/python3.8/site-packages/nova/compute/manager.py", line 
9946, in _update_available_resource_for_node
  2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager     
self.rt.update_available_resource(context, nodename,
  2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager   File 
"/var/lib/openstack/lib/python3.8/site-packages/nova/compute/resource_tracker.py",
 line 879, in update_available_resource
  2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager     resources = 
self.driver.get_available_resource(nodename)
  2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager   File 
"/var/lib/openstack/lib/python3.8/site-packages/nova/virt/libvirt/driver.py", 
line 8937, in get_available_resource
  2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager     
data['pci_passthrough_devices'] = self._get_pci_passthrough_devices()
  2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager   File 
"/var/lib/openstack/lib/python3.8/site-packages/nova/virt/libvirt/driver.py", 
line 7663, in _get_pci_passthrough_devices
  2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager     vdpa_devs = [
  2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager   File 
"/var/lib/openstack/lib/python3.8/site-packages/nova/virt/libvirt/driver.py", 
line 7664, in <listcomp>
  2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager     dev for dev in 
devices.values() if "vdpa" in dev.listCaps()
  2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager   File 
"/var/lib/openstack/lib/python3.8/site-packages/libvirt.py", line 6276, in 
listCaps
  2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager     raise 
libvirtError('virNodeDeviceListCaps() failed')
  2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager 
libvirt.libvirtError: Node device not found: no node device with matching name 
'net_tap8b08ec90_e5_fe_16_3e_0f_0a_d4'
  2022-05-06 20:16:16.110 4053032 ERROR nova.compute.manager
  
  I think the cleaner way is to loop over all the items and skip a device
  if it raises an error that the device is not found.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1972028

Title:
  [SRU] _get_pci_passthrough_devices prone to race condition

To manage notifications about this bug go to:
https://bugs.launchpad.net/cloud-archive/+bug/1972028/+subscriptions


-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to