After chatting offline with Matt, it became apparent that
i somehow missed the fact that the ctb processing handler
was already in a work queue. That said, Matt is correct, i
dont need to create a work queue to extract that capture
log into the interim-store. That would eliminate the race
condition (when the context-reset notification comes
in later via the same queue).

Thanks again Matt.

....alan


On Tue, 2021-12-07 at 21:15 -0800, Alan Previn Teres Alexis wrote:
> Thanks Matt for reviewing. Responses to the questions you had.
> will fix the rest on next rev.
>  
> > > @@ -4013,10 +4016,11 @@ int intel_guc_error_capture_process_msg(struct 
> > > intel_guc *guc,
> > >           return -EPROTO;
> > >   }
> > >  
> > > - status = msg[0];
> > > - drm_info(&guc_to_gt(guc)->i915->drm, "Got error capture: status = %d", 
> > > status);
> > > + status = msg[0] & INTEL_GUC_STATE_CAPTURE_EVENT_STATUS_MASK;
> > > + if (status == INTEL_GUC_STATE_CAPTURE_EVENT_STATUS_NOSPACE)
> > > +         drm_warn(&guc_to_gt(guc)->i915->drm, "G2H-Error capture no 
> > > space\n");
> > >  
> > > - /* Add extraction of error capture dump */
> > > + intel_guc_capture_store_snapshot(guc);
> > 
> > This is done in different worker, right? How does this not race with an
> > engine reset notification that does an error capture (e.g. the error
> > capture is done before we read out the info from the GuC)?
> > 
> I believe the guc error state capture notification event comes right before
> context resets, not engine resets. Also, the i915_gpu_coredump subsystem
> gathers hw state in response to the a context hanging and is done prior to
> the hw reset. Therefore, engine reset notification doesnt play a role here.
> However, since the context reset notification is expected to come right after
> the error state capture notification and your argument is valid in this 
> case...
> you could argue a race condition can exist when the context reset event leads
> to calling of i915_gpu_coredump subsystem (which in turn gets a pointer to
> the intel_guc_capture module), however even here, no actual parsing is done
> yet - i am currently waiting on the actual debugfs call before parsing any
> of the data. As a fix, However, I add a flush_work at the time the call to
> the parsing happens even later?
> 
> 
> > As far as I can tell 'intel_guc_capture_store_snapshot' doesn't allocate
> > memory so I don't think we need a worker here.
> > 
> Yes, i am not doing any allocation but the worker function does lock the
> capture_store's mutex (to ensure we dont trample on the circular buffer 
> pointers
> of the interim store (the one between the guc-log-buffer and the error-capture
> reporting). I am not sure if spin_lock_irqsave would be safe considering in 
> the
> event we had back to back error captures, then we wouldnt want to be spinning 
> that
> long if coincidentially the reporting side is actively parsing the bytestream 
> output of the same interim buffer.
> 
> After thinking a bit more, a lockless solution is possible, i could double
> buffer the interim-store-circular-buffer and so when the event comes in, i 
> flip
> to the next "free" interim-store (that isnt filled with pending logs waiting
> to be read or being read). If there is no available interim-store, (i.e.
> we've already had 2 error state captures that have yet to be read/flushed), 
> then
> we just drop the output to the floor. (this last part would be in line with 
> the
> current execlist model.. unless something changed there in the last 2 weeks).
> 
> However the simplest solution from with this series today, would be to 
> flush_work
> much later at the time the debugfs calls to get the output error capture 
> report
> (since that would be the last chance to resolve the possible race condition).
> I could call the flush_work earlier at the context_reset notification, but 
> that too
> would be an irq handler path. 
> 
> > Matt
> > 
> > >  
> > >   return 0;
> > >  }
> > > -- 
> > > 2.25.1
> > > 

Reply via email to