Pre-existing issues:
- [High] Deferred GEM object freeing accesses destroyed IOMMU domain
structures, causing a Use-After-Free.
- [High] Deferred GEM object freeing calls flush_workqueue() on a destroyed
workqueue pointer.
--
commit ec7bb08ec6d7a0837cd580d24a3f76e7ed06a6c6
Author: Lizhi Hou <[email protected]>
accel/amdxdna: Fix amdxdna_client lifetime race during device removal
This patch changes the client cleanup logic during device removal.
It decouples hardware teardown from client cleanup to avoid a use-after-free
if amdxdna_drm_close() is called concurrently with amdxdna_remove().
diff --git a/drivers/accel/amdxdna/amdxdna_pci_drv.c
b/drivers/accel/amdxdna/amdxdna_pci_drv.c
index c677293c1ae7f..3016db07c7cad 100644
--- a/drivers/accel/amdxdna/amdxdna_pci_drv.c
+++ b/drivers/accel/amdxdna/amdxdna_pci_drv.c
@@ -167,18 +167,12 @@ static void amdxdna_drm_close(struct drm_device *ddev,
struct drm_file *filp)
{
struct amdxdna_client *client = filp->driver_priv;
struct amdxdna_dev *xdna = to_xdna_dev(ddev);
- int idx;
XDNA_DBG(xdna, "closing pid %d", client->pid);
- if (!drm_dev_enter(&xdna->ddev, &idx))
- return;
-
mutex_lock(&xdna->dev_lock);
amdxdna_client_cleanup(client);
[Severity: High]
Since amdxdna_client_cleanup() is now deferred until file close, could this
trigger a NULL pointer dereference during SVA teardown?
When a device is removed, the PCI subsystem unlinks it from its IOMMU group.
Later, when this file is closed, amdxdna_sva_fini() calls
iommu_sva_unbind_device().
Inside the IOMMU core, iommu_detach_device_pasid() dereferences the group
pointer:
struct iommu_group *group = dev->iommu_group;
mutex_lock(&group->mutex);
If the device was already removed, dev->iommu_group would be NULL, causing a
crash. Does the SVA unbind need to happen earlier during device removal?
mutex_unlock(&xdna->dev_lock);
-
- drm_dev_exit(idx);
}
[ ... ]
@@ -436,14 +430,8 @@ static void amdxdna_remove(struct pci_dev *pdev)
amdxdna_sysfs_fini(xdna);
mutex_lock(&xdna->dev_lock);
- client = list_first_entry_or_null(&xdna->client_list,
- struct amdxdna_client, node);
- while (client) {
- amdxdna_client_cleanup(client);
-
- client = list_first_entry_or_null(&xdna->client_list,
- struct amdxdna_client, node);
- }
+ list_for_each_entry(client, &xdna->client_list, node)
[Severity: High]
Does this list_for_each_entry() loop have a use-after-free risk if the lock
is dropped internally?
The loop calls amdxdna_hwctx_remove_all(), which invokes aie2_hwctx_fini().
Inside aie2_hwctx_fini(), xdna->dev_lock is temporarily dropped to wait for
hardware idle:
mutex_unlock(&xdna->dev_lock);
Since this patch removes the drm_dev_enter() safeguard from
amdxdna_drm_close(), a concurrent file close can acquire dev_lock while it
is dropped here. The concurrent close would remove the client from the list
via list_del() and free the client structure.
When the lock is re-acquired and list_for_each_entry() attempts to fetch the
next element, won't it dereference the freed client->node.next pointer?
+ amdxdna_hwctx_remove_all(client);
xdna->dev_info->ops->fini(xdna);
mutex_unlock(&xdna->dev_lock);
[Severity: High]
This is a pre-existing issue, but does this function leave xdna->domain
dangling, leading to a use-after-free on deferred file close?
During device removal, amdxdna_remove() calls amdxdna_iommu_fini(), which
frees the domain but leaves the pointer intact:
iommu_domain_free(xdna->domain);
If the application keeps the file descriptor open, drm_gem_object_put() will
later invoke amdxdna_gem_obj_free() during cleanup. This calls
amdxdna_dma_unmap_bo():
if (amdxdna_iova_on(xdna)) {
size = iova_align(&xdna->iovad, abo->mem.size);
iommu_unmap(xdna->domain, abo->mem.dma_addr, size);
Since xdna->domain is not NULL, amdxdna_iova_on() returns true, and
iommu_unmap() is called on the freed domain. Should the domain pointer
be set to NULL after being freed?
[Severity: High]
This is a pre-existing issue, but could deferred GEM cleanup trigger a
use-after-free on the notifier workqueue?
Earlier in amdxdna_remove(), destroy_workqueue(xdna->notifier_wq) frees the
workqueue structure.
When a process retains an open file descriptor past device removal, the DRM
core defers GEM cleanup to the eventual file close. When
drm_gem_object_put() calls amdxdna_gem_obj_free(), it unconditionally does:
flush_workqueue(xdna->notifier_wq);
Since the workqueue was already destroyed and freed, won't this result in
a use-after-free crash?