On Tue, Oct 21, 2025 at 4:43 PM Matthew Brost <[email protected]> wrote: > > On Sat, Oct 18, 2025 at 12:42:30AM -0700, Matthew Brost wrote: > > On Fri, Oct 17, 2025 at 11:43:51PM -0700, Matthew Brost wrote: > > > On Fri, Oct 17, 2025 at 10:37:46AM -0500, Rob Herring wrote: > > > > On Thu, Oct 16, 2025 at 11:25:34PM -0700, Matthew Brost wrote: > > > > > On Thu, Oct 16, 2025 at 04:06:05PM -0500, Rob Herring (Arm) wrote: > > > > > > Add a driver for Arm Ethos-U65/U85 NPUs. The Ethos-U NPU has a > > > > > > relatively simple interface with single command stream to describe > > > > > > buffers, operation settings, and network operations. It supports up > > > > > > to 8 > > > > > > memory regions (though no h/w bounds on a region). The Ethos NPUs > > > > > > are designed to use an SRAM for scratch memory. Region 2 is reserved > > > > > > for SRAM (like the downstream driver stack and compiler). Userspace > > > > > > doesn't need access to the SRAM. > > > > > > > > Thanks for the review. > > > > > > > > [...] > > > > > > > > > > +static struct dma_fence *ethosu_job_run(struct drm_sched_job > > > > > > *sched_job) > > > > > > +{ > > > > > > + struct ethosu_job *job = to_ethosu_job(sched_job); > > > > > > + struct ethosu_device *dev = job->dev; > > > > > > + struct dma_fence *fence = NULL; > > > > > > + int ret; > > > > > > + > > > > > > + if (unlikely(job->base.s_fence->finished.error)) > > > > > > + return NULL; > > > > > > + > > > > > > + fence = ethosu_fence_create(dev); > > > > > > > > > > Another reclaim issue: ethosu_fence_create allocates memory using > > > > > GFP_KERNEL. Since we're already in the DMA fence signaling path > > > > > (reclaim), this can lead to a deadlock. > > > > > > > > > > Without too much thought, you likely want to move this allocation to > > > > > ethosu_job_do_push, but before taking dev->sched_lock or calling > > > > > drm_sched_job_arm. > > > > > > > > > > We really should fix the DRM scheduler work queue to be tainted with > > > > > reclaim. If I recall correctly, we'd need to update the work queue > > > > > layer. Let me look into that—I've seen this type of bug several times, > > > > > and lockdep should be able to catch it. > > > > > > > > Likely the rocket driver suffers from the same issues... > > > > > > > > > > I am not surprised by this statement. > > > > > > > > > > > > > > + if (IS_ERR(fence)) > > > > > > + return fence; > > > > > > + > > > > > > + if (job->done_fence) > > > > > > + dma_fence_put(job->done_fence); > > > > > > + job->done_fence = dma_fence_get(fence); > > > > > > + > > > > > > + ret = pm_runtime_get_sync(dev->base.dev); > > > > > > > > > > I haven't looked at your PM design, but this generally looks quite > > > > > dangerous with respect to reclaim. For example, if your PM resume > > > > > paths > > > > > allocate memory or take locks that allocate memory underneath, you're > > > > > likely to run into issues. > > > > > > > > > > A better approach would be to attach a PM reference to your job upon > > > > > creation and release it upon job destruction. That would be safer and > > > > > save you headaches in the long run. > > > > > > > > Our PM is nothing more than clock enable/disable and register init. > > > > > > > > If the runtime PM API doesn't work and needs special driver wrappers, > > > > then I'm inclined to just not use it and manage clocks directly (as > > > > that's all it is doing). > > > > > > > > > > Yes, then you’re probably fine. More complex drivers can do all sorts of > > > things during a PM wake, which is why PM wakes should generally be the > > > outermost layer. I still suggest, to future-proof your code, that you > > > move the PM reference to an outer layer. > > > > > > > Also, taking a PM reference in a function call — as opposed to tying it > > to a object's lifetime — is risky. It can quickly lead to imbalances in > > PM references if things go sideways or function calls become unbalanced. > > Depending on how your driver uses the DRM scheduler, this seems like a > > real possibility. > > > > Matt > > > > > > > > > > > > This is what we do in Xe [1] [2]. > > > > > > > > > > Also, in general, this driver has been reviewed (RB’d), but it's not > > > > > great that I spotted numerous issues within just five minutes. I > > > > > suggest > > > > > taking a step back and thoroughly evaluating everything this driver is > > > > > doing. > > > > > > > > Well, if it is hard to get simple drivers right, then it's a problem > > > > with the subsystem APIs IMO. > > > > > > > > > > Yes, agreed. We should have assertions and lockdep annotations in place > > > to catch driver-side misuses. This is the second driver I’ve randomly > > > looked at over the past year that has broken DMA fencing and reclaim > > > rules. I’ll take an action item to fix this in the DRM scheduler, but > > > I’m afraid I’ll likely break multiple drivers in the process as misuess > > > / lockdep will complain. > > I've posted a series [1] for the DRM scheduler which will complain about the > things I've pointed out here.
Thanks. I ran v6 with them and no lockdep splats. Rob
