On 11.07.25 15:13, Philipp Stanner wrote: > On Thu, 2025-07-10 at 08:33 +0000, cao, lin wrote: >> >> [AMD Official Use Only - AMD Internal Distribution Only] >> >> >> >> Hi Christian, >> >> >> Thanks for your suggestion, I modified the patch as: > > Looks promising. You'll send a v2 I guess :)
Well I was just about to reply that a proper v2 should be send out and not just the change fragment :) So Lin please send a properly formated v2 patch. Regards, Christian. > > P. > >> >> >> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c >> b/drivers/gpu/drm/scheduler/sched_entity.c >> index e671aa241720..66f2a43c58fd 100644 >> --- a/drivers/gpu/drm/scheduler/sched_entity.c >> +++ b/drivers/gpu/drm/scheduler/sched_entity.c >> @@ -177,6 +177,7 @@ static void >> drm_sched_entity_kill_jobs_work(struct work_struct *wrk) >> struct drm_sched_job *job = container_of(wrk, typeof(*job), >> work); >> >> drm_sched_fence_scheduled(job->s_fence, NULL); >> + drm_sched_wakeup(job->sched); >> drm_sched_fence_finished(job->s_fence, -ESRCH); >> WARN_ON(job->s_fence->parent); >> job->sched->ops->free_job(job); >> -- >> >> >> Thanks, >> Lin >> >> >> >> >> >> From: Koenig, Christian <[email protected]> >> Sent: Thursday, July 10, 2025 15:52 >> To: cao, lin <[email protected]>; [email protected] >> <[email protected]>; [email protected] >> <[email protected]> >> Cc: Yin, ZhenGuo (Chris) <[email protected]>; Deng, Emily >> <[email protected]>; Matthew Brost <[email protected]>; Danilo >> Krummrich <[email protected]>; Philipp Stanner <[email protected]> >> Subject: Re: [PATCH] drm/scheduler: Fix sched hang when killing app >> with dependent jobs >> >> >> First of all you need to CC the scheduler maintainers, try to use the >> get_maintainer.pl script. Adding them on CC. >> >> On 10.07.25 08:36, Lin.Cao wrote: >>> When Application A submits jobs (a1, a2, a3) and application B >>> submits >>> job b1 with a dependency on a2's scheduler fence, killing >>> application A >>> before run_job(a1) causes drm_sched_entity_kill_jobs_work() to >>> force >>> signal all jobs sequentially. However, due to missing work_run_job >>> or >>> work_free_job in entity_kill_job_work(), the scheduler enters sleep >>> state, causing application B hang. >> >> Ah! Because of optimizing the dependency when submitting to the same >> scheduler in drm_sched_entity_add_dependency_cb(). >> >> Yeah that suddenly starts to make sense. >> >>> Add drm_sched_wakeup() when entity_kill_job_work() to preventing >>> scheduler sleep and subsequent application hangs. >>> >>> Signed-off-by: Lin.Cao <[email protected]> >>> --- >>> drivers/gpu/drm/scheduler/sched_entity.c | 1 + >>> 1 file changed, 1 insertion(+) >>> >>> diff --git a/drivers/gpu/drm/scheduler/sched_entity.c >>> b/drivers/gpu/drm/scheduler/sched_entity.c >>> index e671aa241720..a22b0f65558a 100644 >>> --- a/drivers/gpu/drm/scheduler/sched_entity.c >>> +++ b/drivers/gpu/drm/scheduler/sched_entity.c >>> @@ -180,6 +180,7 @@ static void >>> drm_sched_entity_kill_jobs_work(struct work_struct *wrk) >>> drm_sched_fence_finished(job->s_fence, -ESRCH); >>> WARN_ON(job->s_fence->parent); >>> job->sched->ops->free_job(job); >>> + drm_sched_wakeup(job->sched); >> >> That should probably be after drm_sched_fence_scheduled(). >> >> Alternatively we could also drop the optimization in >> drm_sched_entity_add_dependency_cb(), scheduling the work item again >> has only minimal overhead. >> >> Apart from that looks good to me. >> >> Regards, >> Christian. >> >>> } >>> >>> /* Signal the scheduler finished fence when the entity in >>> question is killed. */ >> >
