Thank you for your contribution! Sashiko AI review found 4 potential issue(s) 
to consider:

New issues:
- [High] Mismatched and strict return value checks for 
`pm_runtime_get_if_active()` break IRQ handling when Runtime PM is disabled, 
causing job hangs.

Pre-existing issues:
- [High] The runtime suspend callback powers down hardware without calling 
`synchronize_irq()`, violating shared IRQ rules.
- [High] Unhandled hardware DMA errors result in an unacknowledged interrupt, 
causing an immediate IRQ storm.
- [High] Calling `iommu_detach_group()` and `iommu_attach_group()` with a NULL 
group pointer causes a kernel panic if the device is not bound to an IOMMU.
--

commit 1a0ceb9d138ab6f44b71efac6f2c376438bd8d22
Author: ZhaoJinming <[email protected]>

accel/rocket: Fix iommu_group leak and unsafe IRQ register access

This commit fixes a reference leak of the iommu_group in the IRQ handler
by reusing the cached core->iommu_group. It also fixes unsafe hardware register
accesses in the shared IRQ handler by adding runtime PM checks to verify the
device is active before reading registers.

> diff --git a/drivers/accel/rocket/rocket_job.c 
> b/drivers/accel/rocket/rocket_job.c
> index cf9f61682dd8b..81b222198dac9 100644
> --- a/drivers/accel/rocket/rocket_job.c
> +++ b/drivers/accel/rocket/rocket_job.c
> @@ -349,7 +349,7 @@ static void rocket_job_handle_irq(struct rocket_core 
> *core)
>                               return;
>                       }
>  
> -                     iommu_detach_group(NULL, iommu_group_get(core->dev));
> +                     iommu_detach_group(NULL, core->iommu_group);

[Severity: High]
This is a pre-existing issue, but calling iommu_detach_group() and
iommu_attach_group() with a NULL group pointer causes a kernel panic if
the device is not bound to an IOMMU.

If the device is not behind an IOMMU, iommu_group_get() returns NULL
during initialization.

If CONFIG_IOMMU_API is enabled, both functions unconditionally
dereference the group pointer to acquire its lock, which triggers a NULL
pointer dereference crash.

Does this path also need to ensure core->iommu_group is not NULL before
detaching, and similarly during job submission?

>                       dma_fence_signal(core->in_flight_job->done_fence);
>                       pm_runtime_put_autosuspend(core->dev);
>                       core->in_flight_job = NULL;

[ ... ]

> @@ -420,7 +420,10 @@ static irqreturn_t rocket_job_irq_handler_thread(int 
> irq, void *data)
>  {
>       struct rocket_core *core = data;
>  
> -     rocket_job_handle_irq(core);
> +     if (pm_runtime_get_if_active(core->dev) == 1) {
> +             rocket_job_handle_irq(core);
> +             pm_runtime_put(core->dev);
> +     }

[Severity: High]
When CONFIG_PM is disabled or runtime PM is globally disabled for the
device, pm_runtime_get_if_active() returns -EINVAL.

Because -EINVAL != 1, the condition evaluates to false, silently
skipping rocket_job_handle_irq() entirely.

Since the threaded handler skips execution, the completed job is never
processed, the IOMMU group is not detached, the DMA fence is never
signaled, and waiting user processes hang permanently.

Could this strict check cause a regression in IRQ handling when Runtime
PM is disabled?

>  
>       return IRQ_HANDLED;
>  }
> @@ -428,16 +431,28 @@ static irqreturn_t rocket_job_irq_handler_thread(int 
> irq, void *data)
>  static irqreturn_t rocket_job_irq_handler(int irq, void *data)
>  {
>       struct rocket_core *core = data;
> -     u32 raw_status = rocket_pc_readl(core, INTERRUPT_RAW_STATUS);
> +     u32 raw_status;
> +     int ret;
> +
> +     ret = pm_runtime_get_if_active(core->dev);
> +     if (ret == 0)
> +             return IRQ_NONE;
> +
> +     raw_status = rocket_pc_readl(core, INTERRUPT_RAW_STATUS);
>  
>       WARN_ON(raw_status & PC_INTERRUPT_RAW_STATUS_DMA_READ_ERROR);
>       WARN_ON(raw_status & PC_INTERRUPT_RAW_STATUS_DMA_WRITE_ERROR);
>  
>       if (!(raw_status & PC_INTERRUPT_RAW_STATUS_DPU_0 ||
> -           raw_status & PC_INTERRUPT_RAW_STATUS_DPU_1))
> +           raw_status & PC_INTERRUPT_RAW_STATUS_DPU_1)) {
> +             if (ret > 0)
> +                     pm_runtime_put(core->dev);
>               return IRQ_NONE;
> +     }

[Severity: High]
This is a pre-existing issue, but unhandled hardware DMA errors result
in an unacknowledged interrupt, causing an immediate IRQ storm.

If the hardware triggers an interrupt with
PC_INTERRUPT_RAW_STATUS_DMA_READ_ERROR set, but the DPU completion bits
are not set, the condition !(raw_status & DPU) evaluates as true.

It executes the WARN_ON() and immediately returns IRQ_NONE without
writing to INTERRUPT_CLEAR or INTERRUPT_MASK.

Because the shared IRQ line is level-triggered and the hardware is still
asserting the unacknowledged error, the interrupt controller will
immediately re-invoke the handler as soon as it returns.

Will this create an infinite IRQ storm that permanently hogs the CPU
core until the kernel's spurious interrupt detector forcibly disables
the shared IRQ line?

>  
>       rocket_pc_writel(core, INTERRUPT_MASK, 0x0);
> +     if (ret > 0)
> +             pm_runtime_put(core->dev);
>  
>       return IRQ_WAKE_THREAD;
>  }

[Severity: High]
This is a pre-existing issue, but the runtime suspend callback powers
down hardware without calling synchronize_irq(), violating shared IRQ
rules.

The global subsystem guide explicitly states: "When devices use shared
IRQs (IRQF_SHARED), the runtime suspend callback must call
synchronize_irq() before powering down hardware to ensure no IRQ handler
is executing mid-flight."

The device requests its interrupt with IRQF_SHARED, but its runtime
suspend callback disables the device's clocks without waiting for
executing IRQ handlers to finish:

drivers/accel/rocket/rocket_drv.c:rocket_device_runtime_suspend() {
    ...
    clk_bulk_disable_unprepare(ARRAY_SIZE(rdev->cores[core].clks),
                               rdev->cores[core].clks);
    ...
}

Might this lead to mid-flight handlers reading powered-down hardware
registers, causing bus faults?

-- 
Sashiko AI review · 
https://sashiko.dev/#/patchset/[email protected]?part=2

Reply via email to