Three issues exist in the error paths of rocket_job_run():

1) dma_fence reference leak: After creating a fence and taking an extra
   reference for job->done_fence via dma_fence_get(), the error paths
   return without releasing the extra reference held by job->done_fence.
   The leaked reference prevents the fence from being freed, causing
   resource accumulation on repeated failures.

2) pm_runtime_get_sync() usage counter leak: pm_runtime_get_sync()
   increments the runtime PM usage counter before attempting to resume
   the device. If the resume fails and returns an error, the usage
   counter remains incremented. The original error path does not call
   pm_runtime_put_noidle() to balance it. Repeated failures will
   permanently prevent the NPU from entering suspend.

3) Unsignaled fence returned on failure: The error paths return a valid
   but unsignaled dma_fence to the DRM scheduler. Since the hardware
   was never submitted, the fence is never signaled. When the scheduler
   eventually drops its reference, dma_fence_release() detects the
   unsignaled fence and triggers:
     WARN(1, "Fence ... released with pending signals!")
   and forcibly signals it with -EDEADLK.

Fix all three issues by:

- Replace pm_runtime_get_sync() with pm_runtime_resume_and_get(), which
  automatically decrements the usage counter on failure, eliminating
  the need for a manual pm_runtime_put_noidle() call and avoiding the
  usage counter leak. The pm_runtime_get_sync() documentation itself
  recommends pm_runtime_resume_and_get() as the preferred alternative
  when the return value is checked by the caller.

- Release both fence references (job->done_fence and the local fence)
  before returning ERR_PTR(ret) so the DRM scheduler cleanly aborts
  the job without triggering the unsignaled fence WARN.

- Add pm_runtime_put() on the iommu_attach_group error path to release
  the runtime PM reference that was successfully acquired.

Cc: [email protected]
Fixes: 0810d5ad88a1 ("accel/rocket: Add job submission IOCTL")
Signed-off-by: ZhaoJinming <[email protected]>
---
 drivers/accel/rocket/rocket_job.c | 19 ++++++++++++++-----
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/drivers/accel/rocket/rocket_job.c 
b/drivers/accel/rocket/rocket_job.c
index ac51bff39833..e8a073e22ac2 100644
--- a/drivers/accel/rocket/rocket_job.c
+++ b/drivers/accel/rocket/rocket_job.c
@@ -310,13 +310,22 @@ static struct dma_fence *rocket_job_run(struct 
drm_sched_job *sched_job)
                dma_fence_put(job->done_fence);
        job->done_fence = dma_fence_get(fence);
 
-       ret = pm_runtime_get_sync(core->dev);
-       if (ret < 0)
-               return fence;
+       ret = pm_runtime_resume_and_get(core->dev);
+       if (ret < 0) {
+               dma_fence_put(job->done_fence);
+               job->done_fence = NULL;
+               dma_fence_put(fence);
+               return ERR_PTR(ret);
+       }
 
        ret = iommu_attach_group(job->domain->domain, core->iommu_group);
-       if (ret < 0)
-               return fence;
+       if (ret < 0) {
+               pm_runtime_put(core->dev);
+               dma_fence_put(job->done_fence);
+               job->done_fence = NULL;
+               dma_fence_put(fence);
+               return ERR_PTR(ret);
+       }
 
        scoped_guard(mutex, &core->job_lock) {
                core->in_flight_job = job;
-- 
2.20.1

Reply via email to