Three issues exist in the error paths of rocket_job_run():
1) dma_fence reference leak: After creating a fence and taking an extra
reference for job->done_fence via dma_fence_get(), the error paths
return without releasing the extra reference held by job->done_fence.
The leaked reference prevents the fence from being freed, causing
resource accumulation on repeated failures.
2) pm_runtime_get_sync() usage counter leak: pm_runtime_get_sync()
increments the runtime PM usage counter before attempting to resume
the device. If the resume fails and returns an error, the usage
counter remains incremented. The original error path does not call
pm_runtime_put_noidle() to balance it. Repeated failures will
permanently prevent the NPU from entering suspend.
3) Unsignaled fence returned on failure: The error paths return a valid
but unsignaled dma_fence to the DRM scheduler. Since the hardware
was never submitted, the fence is never signaled. When the scheduler
eventually drops its reference, dma_fence_release() detects the
unsignaled fence and triggers:
WARN(1, "Fence ... released with pending signals!")
and forcibly signals it with -EDEADLK.
Fix all three issues by:
- Replace pm_runtime_get_sync() with pm_runtime_resume_and_get(), which
automatically decrements the usage counter on failure, eliminating
the need for a manual pm_runtime_put_noidle() call and avoiding the
usage counter leak. The pm_runtime_get_sync() documentation itself
recommends pm_runtime_resume_and_get() as the preferred alternative
when the return value is checked by the caller.
- Release both fence references (job->done_fence and the local fence)
before returning ERR_PTR(ret) so the DRM scheduler cleanly aborts
the job without triggering the unsignaled fence WARN.
- Add pm_runtime_put() on the iommu_attach_group error path to release
the runtime PM reference that was successfully acquired.
Cc: [email protected]
Fixes: 0810d5ad88a1 ("accel/rocket: Add job submission IOCTL")
Signed-off-by: ZhaoJinming <[email protected]>
---
drivers/accel/rocket/rocket_job.c | 19 ++++++++++++++-----
1 file changed, 14 insertions(+), 5 deletions(-)
diff --git a/drivers/accel/rocket/rocket_job.c
b/drivers/accel/rocket/rocket_job.c
index ac51bff39833..e8a073e22ac2 100644
--- a/drivers/accel/rocket/rocket_job.c
+++ b/drivers/accel/rocket/rocket_job.c
@@ -310,13 +310,22 @@ static struct dma_fence *rocket_job_run(struct
drm_sched_job *sched_job)
dma_fence_put(job->done_fence);
job->done_fence = dma_fence_get(fence);
- ret = pm_runtime_get_sync(core->dev);
- if (ret < 0)
- return fence;
+ ret = pm_runtime_resume_and_get(core->dev);
+ if (ret < 0) {
+ dma_fence_put(job->done_fence);
+ job->done_fence = NULL;
+ dma_fence_put(fence);
+ return ERR_PTR(ret);
+ }
ret = iommu_attach_group(job->domain->domain, core->iommu_group);
- if (ret < 0)
- return fence;
+ if (ret < 0) {
+ pm_runtime_put(core->dev);
+ dma_fence_put(job->done_fence);
+ job->done_fence = NULL;
+ dma_fence_put(fence);
+ return ERR_PTR(ret);
+ }
scoped_guard(mutex, &core->job_lock) {
core->in_flight_job = job;
--
2.20.1