On 5/15/2025 11:37 AM, Jesus Narvaez wrote:
There is a rare race condition when preparing for a reset where
guc_lrc_desc_unpin() could be in the process of deregistering a context
while a different thread is scrubbing outstanding contexts and it alters
the context state and does a wakeref put. Then, if there is a failure
with deregister_context(), a second wakeref put could occur. As a result
the wakeref count could drop below 0 and fail an INTEL_WAKEREF_BUG_ON()
check.

Therefore if there is a failure with deregister_context(), undo the
context state changes and do a wakeref put only if the context was set
to be destroyed earlier.

Fixes: 2f2cc53b5fe7 ("drm/i915/guc: Close deregister-context race against 
CT-loss")
Signed-off-by: Jesus Narvaez <[email protected]>
Cc: Daniele Ceraolo Spurio <[email protected]>
Cc: Alan Previn <[email protected]>
Cc: Anshuman Gupta <[email protected]>
Cc: Mousumi Jana <[email protected]>
Cc: Rodrigo Vivi <[email protected]>
Cc: Matt Roper <[email protected]>
---
  .../gpu/drm/i915/gt/uc/intel_guc_submission.c    | 16 +++++++++++-----
  1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index f8cb7c630d5b..f066c2dd7114 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -3441,20 +3441,26 @@ static inline int guc_lrc_desc_unpin(struct 
intel_context *ce)
/*
         * GuC is active, lets destroy this context, but at this point we can 
still be racing
-        * with suspend, so we undo everything if the H2G fails in 
deregister_context so
-        * that GuC reset will find this context during clean up.
+        * with suspend, so we undo everything if the H2G fails in 
deregister_context
+        * and if the context was scheduled to be destroyed so that GuC reset 
will
+        * find this context during clean up.

Just adding "and if the context was scheduled to be destroyed" to the comment is not very clear, because looking at this function it is hard to consider that the state might change due to the reset code kicking in. This needs to be expanded a bit to explain the race.

Daniele

         */
        ret = deregister_context(ce, ce->guc_id.id);
        if (ret) {
+               bool pending_destroyed;
                spin_lock_irqsave(&ce->guc_state.lock, flags);
-               set_context_registered(ce);
-               clr_context_destroyed(ce);
+               pending_destroyed = context_destroyed(ce);
+               if (pending_destroyed) {
+                       set_context_registered(ce);
+                       clr_context_destroyed(ce);
+               }
                spin_unlock_irqrestore(&ce->guc_state.lock, flags);
                /*
                 * As gt-pm is awake at function entry, intel_wakeref_put_async 
merely decrements
                 * the wakeref immediately but per function spec usage call 
this after unlock.
                 */
-               intel_wakeref_put_async(&gt->wakeref);
+               if (pending_destroyed)
+                       intel_wakeref_put_async(&gt->wakeref);
        }
return ret;

Reply via email to