Hi,

Somehow I stumbled on this while browsing through the mailing list.

On 23/12/2021 18:54, Teres Alexis, Alan Previn wrote:
Revisiting below hunk of patch-7 comment, as per offline discussion with Matt,
there is little benefit to even making that guc-id lookup because:

1. the delay between the context reset notification (when the vmas are copied
    and when we verify we had received a guc err capture dump) may be 
subjectively
    large enough and not tethered that the guc-id may have already been 
re-assigned.

2. I was really looking for some kind of unique context handle to print out 
that could
    be correlated (by user inspecting the dump) back to a unique app or process 
or
    context-id but cant find such a param in struct intel_context.

As part of further reviewing the end to end flows and possible error scenarios, 
there
also may potentially be a mismatch between "which context was reset by guc at 
time-n"
vs "which context's vma buffers is being printed out at time-n+x" if
we are experiencing back-to-back resets and the user dumped the debugfs x-time 
later.

What does this all actually mean, because it sounds rather alarming, that it just won't be possible to know which context, belonging to which process, was reset? And because of guc_id potentially re-assigned even the captured VMAs may not be the correct ones?

Regards,

Tvrtko


(Recap: First, guc notifies capture event, second, guc notifies context reset 
during
which we trigger i915_gpu_coredump. In this second step, the vma's are dumped 
and we
verify that the guc capture happened but don't parse the guc-err-capture-logs 
yet.
Third step is when user triggers the debugfs to dump which is when we parse the 
error
capture logs.)

As a fix, what we can do in the guc_error_capture report out is to ensure that
we dont re-print the previously dumped vmas if we end up finding multiple
guc-error-capture dumps since the i915_gpu_coredump would have only captured 
the vma's
for the very first context that was reset. And with guc-submission, that would 
always
correlate to the "next-yet-to-be-parsed" guc-err-capture dump (since the 
guc-error-capture
logs are large enough to hold data for multiple dumps).

The changes (removal of below-hunk and adding of only-print-the-first-vma") is 
trivial
but i felt it warranted a good explanation. Apologies for the inbox noise.

...alan

On Tue, 2021-12-07 at 22:32 -0800, Alan Previn Teres Alexis wrote:
Thanks again for the detailed review here.
Will fix all the rest on next rev.
One special response for this one:


On Tue, 2021-12-07 at 16:22 -0800, Matthew Brost wrote:
On Mon, Nov 22, 2021 at 03:04:02PM -0800, Alan Previn wrote:
+                       if (datatype == GUC_CAPTURE_LIST_TYPE_ENGINE_INSTANCE) {
+                               GCAP_PRINT_GUC_INST_INFO(i915, ebuf, data);
+                               eng_inst = 
FIELD_GET(GUC_CAPTURE_DATAHDR_SRC_INSTANCE, data.info);
+                               eng = guc_lookup_engine(guc, engineclass, 
eng_inst);
+                               if (eng) {
+                                       GCAP_PRINT_INTEL_ENG_INFO(i915, ebuf, 
eng);
+                               } else {
+                                       PRINT(&i915->drm, ebuf, "    i915-Eng-Lookup 
Fail!\n");
+                               }
+                               ce = guc_context_lookup(guc, data.guc_ctx_id);

You are going to need to reference count the 'ce' here. See
intel_guc_context_reset_process_msg for an example.


Oh crap - i missed this one - which you had explicitly mentioned offline when i 
was doing the
development. Sorry about that i just totally missed it from my todo-notes.

...alan

Reply via email to