Hi Andrzej and Chris,

pushed to drm-intel-gt-next

Thanks!
Andi

On Wed, Sep 21, 2022 at 03:52:58PM +0200, Andrzej Hajda wrote:
> From: Chris Wilson <[email protected]>
> 
> When we submit a new pair of contexts to ELSP for execution, we start a
> timer by which point we expect the HW to have switched execution to the
> pending contexts. If the promotion to the new pair of contexts has not
> occurred, we declare the executing context to have hung and force the
> preemption to take place by resetting the engine and resubmitting the
> new contexts.
> 
> This can lead to an unfair situation where almost all of the preemption
> timeout is consumed by the first context which just switches into the
> second context immediately prior to the timer firing and triggering the
> preemption reset (assuming that the timer interrupts before we process
> the CS events for the context switch). The second context hasn't yet had
> a chance to yield to the incoming ELSP (and send the ACk for the
> promotion) and so ends up being blamed for the reset.
> 
> If we see that a context switch has occurred since setting the
> preemption timeout, but have not yet received the ACK for the ELSP
> promotion, rearm the preemption timer and check again. This is
> especially significant if the first context was not schedulable and so
> we used the shortest timer possible, greatly increasing the chance of
> accidentally blaming the second innocent context.
> 
> Fixes: 3a7a92aba8fb ("drm/i915/execlists: Force preemption")
> Fixes: d12acee84ffb ("drm/i915/execlists: Cancel banned contexts on 
> schedule-out")
> Reported-by: Tvrtko Ursulin <[email protected]>
> Signed-off-by: Chris Wilson <[email protected]>
> Cc: Tvrtko Ursulin <[email protected]>
> Cc: Andi Shyti <[email protected]>
> Reviewed-by: Andrzej Hajda <[email protected]>
> Tested-by: Andrzej Hajda <[email protected]>
> Cc: <[email protected]> # v5.5+
> ---
> Hi,
> 
> This patch is upstreamed from internal branch. So I have removed
> R-B by Andi. Andi let me know if your R-B still apply.
> 
> Regards
> Andrzej
> ---
>  drivers/gpu/drm/i915/gt/intel_engine_types.h  | 15 +++++++++++++
>  .../drm/i915/gt/intel_execlists_submission.c  | 21 ++++++++++++++++++-
>  2 files changed, 35 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h 
> b/drivers/gpu/drm/i915/gt/intel_engine_types.h
> index 633a7e5dba3b4b..6b5d4ea22b673b 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
> @@ -165,6 +165,21 @@ struct intel_engine_execlists {
>        */
>       struct timer_list preempt;
>  
> +     /**
> +      * @preempt_target: active request at the time of the preemption request
> +      *
> +      * We force a preemption to occur if the pending contexts have not
> +      * been promoted to active upon receipt of the CS ack event within
> +      * the timeout. This timeout maybe chosen based on the target,
> +      * using a very short timeout if the context is no longer schedulable.
> +      * That short timeout may not be applicable to other contexts, so
> +      * if a context switch should happen within before the preemption
> +      * timeout, we may shoot early at an innocent context. To prevent this,
> +      * we record which context was active at the time of the preemption
> +      * request and only reset that context upon the timeout.
> +      */
> +     const struct i915_request *preempt_target;
> +
>       /**
>        * @ccid: identifier for contexts submitted to this engine
>        */
> diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c 
> b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> index 4b909cb88cdfb7..c718e6dc40b515 100644
> --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> @@ -1241,6 +1241,9 @@ static unsigned long active_preempt_timeout(struct 
> intel_engine_cs *engine,
>       if (!rq)
>               return 0;
>  
> +     /* Only allow ourselves to force reset the currently active context */
> +     engine->execlists.preempt_target = rq;
> +
>       /* Force a fast reset for terminated contexts (ignoring sysfs!) */
>       if (unlikely(intel_context_is_banned(rq->context) || bad_request(rq)))
>               return INTEL_CONTEXT_BANNED_PREEMPT_TIMEOUT_MS;
> @@ -2427,8 +2430,24 @@ static void execlists_submission_tasklet(struct 
> tasklet_struct *t)
>       GEM_BUG_ON(inactive - post > ARRAY_SIZE(post));
>  
>       if (unlikely(preempt_timeout(engine))) {
> +             const struct i915_request *rq = *engine->execlists.active;
> +
> +             /*
> +              * If after the preempt-timeout expired, we are still on the
> +              * same active request/context as before we initiated the
> +              * preemption, reset the engine.
> +              *
> +              * However, if we have processed a CS event to switch contexts,
> +              * but not yet processed the CS event for the pending
> +              * preemption, reset the timer allowing the new context to
> +              * gracefully exit.
> +              */
>               cancel_timer(&engine->execlists.preempt);
> -             engine->execlists.error_interrupt |= ERROR_PREEMPT;
> +             if (rq == engine->execlists.preempt_target)
> +                     engine->execlists.error_interrupt |= ERROR_PREEMPT;
> +             else
> +                     set_timer_ms(&engine->execlists.preempt,
> +                                  active_preempt_timeout(engine, rq));
>       }
>  
>       if (unlikely(READ_ONCE(engine->execlists.error_interrupt))) {
> -- 
> 2.34.1

Reply via email to