Re: [PATCH v4] drm: Don't free jobs in wait_event_interruptible()

Grodzovsky, Andrey Thu, 26 Sep 2019 10:13:21 -0700

On 9/26/19 11:59 AM, Steven Price wrote:
> On 26/09/2019 16:48, Grodzovsky, Andrey wrote:
>> On 9/26/19 11:23 AM, Steven Price wrote:
>>> On 26/09/2019 16:14, Grodzovsky, Andrey wrote:
>>>> On 9/26/19 10:16 AM, Steven Price wrote:
>>>>> drm_sched_cleanup_jobs() attempts to free finished jobs, however because
>>>>> it is called as the condition of wait_event_interruptible() it must not
>>>>> sleep. Unfortuantly some free callbacks (notibly for Panfrost) do sleep.
>>>>>
>>>>> Instead let's rename drm_sched_cleanup_jobs() to
>>>>> drm_sched_get_cleanup_job() and simply return a job for processing if
>>>>> there is one. The caller can then call the free_job() callback outside
>>>>> the wait_event_interruptible() where sleeping is possible before
>>>>> re-checking and returning to sleep if necessary.
>>>>>
>>>>> Signed-off-by: Steven Price <[email protected]>
>>>>> ---
>>>>> Changes from v3:
>>>>>     * drm_sched_main() re-arms the timeout for the next job after calling
>>>>>       free_job()
>>>>>
>>>>>     drivers/gpu/drm/scheduler/sched_main.c | 45 +++++++++++++++-----------
>>>>>     1 file changed, 26 insertions(+), 19 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/scheduler/sched_main.c 
>>>>> b/drivers/gpu/drm/scheduler/sched_main.c
>>>>> index 9a0ee74d82dc..148468447ba9 100644
>>>>> --- a/drivers/gpu/drm/scheduler/sched_main.c
>>>>> +++ b/drivers/gpu/drm/scheduler/sched_main.c
>>>>> @@ -622,43 +622,41 @@ static void drm_sched_process_job(struct dma_fence 
>>>>> *f, struct dma_fence_cb *cb)
>>>>>     }
>>>>>     
>>>>>     /**
>>>>> - * drm_sched_cleanup_jobs - destroy finished jobs
>>>>> + * drm_sched_get_cleanup_job - fetch the next finished job to be 
>>>>> destroyed
>>>>>      *
>>>>>      * @sched: scheduler instance
>>>>>      *
>>>>> - * Remove all finished jobs from the mirror list and destroy them.
>>>>> + * Returns the next finished job from the mirror list (if there is one)
>>>>> + * ready for it to be destroyed.
>>>>>      */
>>>>> -static void drm_sched_cleanup_jobs(struct drm_gpu_scheduler *sched)
>>>>> +static struct drm_sched_job *
>>>>> +drm_sched_get_cleanup_job(struct drm_gpu_scheduler *sched)
>>>>>     {
>>>>> + struct drm_sched_job *job = NULL;
>>>>>           unsigned long flags;
>>>>>     
>>>>>           /* Don't destroy jobs while the timeout worker is running */
>>>>>           if (sched->timeout != MAX_SCHEDULE_TIMEOUT &&
>>>>>               !cancel_delayed_work(&sched->work_tdr))
>>>>> -         return;
>>>>> -
>>>>> +         return NULL;
>>>>>     
>>>>> - while (!list_empty(&sched->ring_mirror_list)) {
>>>>> -         struct drm_sched_job *job;
>>>>> + spin_lock_irqsave(&sched->job_list_lock, flags);
>>>>>     
>>>>> -         job = list_first_entry(&sched->ring_mirror_list,
>>>>> + job = list_first_entry_or_null(&sched->ring_mirror_list,
>>>>>                                          struct drm_sched_job, node);
>>>>> -         if (!dma_fence_is_signaled(&job->s_fence->finished))
>>>>> -                 break;
>>>>>     
>>>>> -         spin_lock_irqsave(&sched->job_list_lock, flags);
>>>>> + if (job && dma_fence_is_signaled(&job->s_fence->finished)) {
>>>>>                   /* remove job from ring_mirror_list */
>>>>>                   list_del_init(&job->node);
>>>>> -         spin_unlock_irqrestore(&sched->job_list_lock, flags);
>>>>> -
>>>>> -         sched->ops->free_job(job);
>>>>> + } else {
>>>>> +         job = NULL;
>>>>> +         /* queue timeout for next job */
>>>>> +         drm_sched_start_timeout(sched);
>>>>>           }
>>>>>     
>>>>> - /* queue timeout for next job */
>>>>> - spin_lock_irqsave(&sched->job_list_lock, flags);
>>>>> - drm_sched_start_timeout(sched);
>>>>>           spin_unlock_irqrestore(&sched->job_list_lock, flags);
>>>>>     
>>>>> + return job;
>>>>>     }
>>>>>     
>>>>>     /**
>>>>> @@ -698,12 +696,21 @@ static int drm_sched_main(void *param)
>>>>>                   struct drm_sched_fence *s_fence;
>>>>>                   struct drm_sched_job *sched_job;
>>>>>                   struct dma_fence *fence;
>>>>> +         struct drm_sched_job *cleanup_job = NULL;
>>>>>     
>>>>>                   wait_event_interruptible(sched->wake_up_worker,
>>>>> -                                  (drm_sched_cleanup_jobs(sched),
>>>>> +                                  (cleanup_job = 
>>>>> drm_sched_get_cleanup_job(sched)) ||
>>>>>                                            (!drm_sched_blocked(sched) &&
>>>>>                                             (entity = 
>>>>> drm_sched_select_entity(sched))) ||
>>>>> -                                  kthread_should_stop()));
>>>>> +                                  kthread_should_stop());
>>>>> +
>>>>> +         while (cleanup_job) {
>>>>> +                 sched->ops->free_job(cleanup_job);
>>>>> +                 /* queue timeout for next job */
>>>>> +                 drm_sched_start_timeout(sched);
>>>>> +
>>>>> +                 cleanup_job = drm_sched_get_cleanup_job(sched);
>>>>> +         }
>>>> Why drm_sched_start_timeout is called both here and inside
>>>> drm_sched_get_cleanup_job ? And also why call it multiple times in the
>>>> loop instead of only once after the loop is done  ?
>>> Christian pointed out to be that the first thing
>>> drm_sched_get_cleanup_job does is call cancel_delayed_work(), and if
>>> that returns false then it bails out with a NULL return. So to actually
>>> get another job (if one exists) the timeout has to be restarted.
>>
>> For this case where timeout work already in progress note that
>> drm_sched_job_timedout restarts the timeout in it's end so it should be
>> ok to restart the timeout unconditionally inside
>> drm_sched_get_cleanup_job as it was done before.
> I may have misinterpreted Christian[1], but I interpreted the below as
> meaning that he'd prefer the caller (drm_sched_main()) to handle the
> drm_sched_start_timeout() in this case:
>
> On 26/09/2019 14:50, Koenig, Christian wrote:
>>> Alternatively the caller could manually re-arm the timeout after
>>> handling the job free.
>> I don't see anything that could go wrong immediately, but that is
>> probably the cleaner approach.
> [1]
> https://lore.kernel.org/lkml/[email protected]/
>
> I wasn't entirely sure if it was safe to leave the timeout running while
> the job free was going on which would be the case if
> drm_sched_get_cleanup_job() re-armed unconditionally.
>
>>> It's also necessary to restart the timeout in the case where the return
>>> is NULL which is handled in the function itself.
>>>
>>> TBH I'm not sure whether this while loop is worth it - it may be better
>>> to replace it with simply:
>>>
>>>     if (cleanup_job) {
>>>             sched->ops->free_job(cleanup_job);
>>>             /* queue timeout for next job */
>>>             drm_sched_start_timeout(sched);
>>>     }
>>>
>>> The outer loop would then handle the next call to
>>> drm_sched_get_cleanup_job() as necessary.
>>
>> What outer loop ?
> There's a loop round the code in drm_sched_main():
>
>       while (!kthread_should_stop()) {
>               [...]
>               wait_event_interruptible(...)
>
>               while (cleanup_job) {...}
>
>               if (!entity)
>                       continue;
>
>               [...]
>       }
>
> So after handling the cleanup_job case, the outer loop will loop back
> round to the wait_event_interruptible() call. One other concern is
> whether it's possible for the thread to be signalled to stop before the
> clean up has been completed. The while loop I added ensures that all
> jobs are actually cleaned up before kthread_should_stop() is checked.
>
> Steve



Yes, seems to me only the inner while loop will guarantee proper free of 
all signaled jobs in mirorr list if the thread is stopped.

Andrey

_______________________________________________
dri-devel mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v4] drm: Don't free jobs in wait_event_interruptible()

Reply via email to