Thank you for your contribution! Sashiko AI review found 2 potential issue(s) 
to consider:

Pre-existing issues:
- [High] The `pm_runtime_get_sync()` error path fails to decrement the runtime 
PM usage counter, permanently leaking the reference.
- [High] Returning an unsignaled dma_fence instead of an error pointer on job 
submission failure triggers a kernel WARN.
--

commit f0df2d483139ad216e9d43923cbf9543a8f4e81c
Author: ZhaoJinming <[email protected]>

accel/rocket: Fix dma_fence reference leak on error paths in rocket_job_run()

In rocket_job_run(), after creating a fence at line 305 and taking an
extra reference for job->done_fence at line 311 via dma_fence_get(),
two error paths fail to release that extra reference before returning.

> diff --git a/drivers/accel/rocket/rocket_job.c 
> b/drivers/accel/rocket/rocket_job.c
> index 2f1861f960ccb..fc10703534e37 100644
> --- a/drivers/accel/rocket/rocket_job.c
> +++ b/drivers/accel/rocket/rocket_job.c
> @@ -311,12 +311,19 @@ static struct dma_fence *rocket_job_run(struct 
> drm_sched_job *sched_job)
>       job->done_fence = dma_fence_get(fence);
>  
>       ret = pm_runtime_get_sync(core->dev);
> -     if (ret < 0)
> +     if (ret < 0) {
> +             dma_fence_put(job->done_fence);
> +             job->done_fence = NULL;

[Severity: High]
This is a pre-existing issue, but does this error path permanently leak the
runtime PM usage counter? 

pm_runtime_get_sync() increments the device's usage counter before attempting
to resume the device. If the resume fails, it returns a negative error code
but the usage counter remains incremented. 

Without explicitly dropping this usage counter (e.g., using
pm_runtime_put_noidle()), could repeated failures prevent the NPU from ever
suspending again?

>               return fence;

[Severity: High]
This is also a pre-existing issue, but will returning the valid, unsignaled
fence instead of ERR_PTR(ret) trigger a kernel WARN?

Since the hardware wasn't submitted, the fence refcount immediately hits 0
after the DRM scheduler drops its initial reference. dma_fence_release()
detects pending callbacks on an unsignaled fence and triggers a loud kernel
WARN(1, "Fence ... released with pending signals!\n"), forcibly signaling it
with -EDEADLK.

Should this path destroy the fence and return ERR_PTR(ret) so the scheduler
cleanly aborts the job?

> +     }
>  
>       ret = iommu_attach_group(job->domain->domain, core->iommu_group);
> -     if (ret < 0)
> +     if (ret < 0) {
> +             pm_runtime_put(core->dev);
> +             dma_fence_put(job->done_fence);
> +             job->done_fence = NULL;
>               return fence;

[Severity: High]
This is a pre-existing issue, but similar to the above, does returning the
unsignaled fence here instead of ERR_PTR(ret) cause the same DRM scheduler
WARN when the fence is released?

> +     }

-- 
Sashiko AI review ยท 
https://sashiko.dev/#/patchset/[email protected]?part=1

Reply via email to