Revisiting below hunk of patch-7 comment, as per offline discussion with Matt,
there is little benefit to even making that guc-id lookup because:

1. the delay between the context reset notification (when the vmas are copied
   and when we verify we had received a guc err capture dump) may be 
subjectively
   large enough and not tethered that the guc-id may have already been 
re-assigned.

2. I was really looking for some kind of unique context handle to print out 
that could
   be correlated (by user inspecting the dump) back to a unique app or process 
or
   context-id but cant find such a param in struct intel_context.

As part of further reviewing the end to end flows and possible error scenarios, 
there
also may potentially be a mismatch between "which context was reset by guc at 
time-n"
vs "which context's vma buffers is being printed out at time-n+x" if
we are experiencing back-to-back resets and the user dumped the debugfs x-time 
later.

(Recap: First, guc notifies capture event, second, guc notifies context reset 
during
which we trigger i915_gpu_coredump. In this second step, the vma's are dumped 
and we
verify that the guc capture happened but don't parse the guc-err-capture-logs 
yet.
Third step is when user triggers the debugfs to dump which is when we parse the 
error
capture logs.)

As a fix, what we can do in the guc_error_capture report out is to ensure that
we dont re-print the previously dumped vmas if we end up finding multiple
guc-error-capture dumps since the i915_gpu_coredump would have only captured 
the vma's
for the very first context that was reset. And with guc-submission, that would 
always
correlate to the "next-yet-to-be-parsed" guc-err-capture dump (since the 
guc-error-capture
logs are large enough to hold data for multiple dumps).

The changes (removal of below-hunk and adding of only-print-the-first-vma") is 
trivial
but i felt it warranted a good explanation. Apologies for the inbox noise.

...alan

On Tue, 2021-12-07 at 22:32 -0800, Alan Previn Teres Alexis wrote:
> Thanks again for the detailed review here.
> Will fix all the rest on next rev.
> One special response for this one:
> 
> 
> On Tue, 2021-12-07 at 16:22 -0800, Matthew Brost wrote:
> > On Mon, Nov 22, 2021 at 03:04:02PM -0800, Alan Previn wrote:
> > > +                 if (datatype == GUC_CAPTURE_LIST_TYPE_ENGINE_INSTANCE) {
> > > +                         GCAP_PRINT_GUC_INST_INFO(i915, ebuf, data);
> > > +                         eng_inst = 
> > > FIELD_GET(GUC_CAPTURE_DATAHDR_SRC_INSTANCE, data.info);
> > > +                         eng = guc_lookup_engine(guc, engineclass, 
> > > eng_inst);
> > > +                         if (eng) {
> > > +                                 GCAP_PRINT_INTEL_ENG_INFO(i915, ebuf, 
> > > eng);
> > > +                         } else {
> > > +                                 PRINT(&i915->drm, ebuf, "    
> > > i915-Eng-Lookup Fail!\n");
> > > +                         }
> > > +                         ce = guc_context_lookup(guc, data.guc_ctx_id);
> > 
> > You are going to need to reference count the 'ce' here. See
> > intel_guc_context_reset_process_msg for an example. 
> > 
> 
> Oh crap - i missed this one - which you had explicitly mentioned offline when 
> i was doing the
> development. Sorry about that i just totally missed it from my todo-notes.
> 
> ...alan

Reply via email to