Quoting Tvrtko Ursulin (2018-05-16 10:49:34)
>
> On 16/05/2018 10:25, Chris Wilson wrote:
> > Quoting Tvrtko Ursulin (2018-05-16 10:15:34)
> >>
> >> On 16/05/2018 07:47, Chris Wilson wrote:
> >>> We cannot call kthread_park() from softirq context, so let's avoid it
> >>> entirely during the reset. We wanted to suspend the signaler so that it
> >>> would not mark a request as complete at the same time as we marked it as
> >>> being in error. Instead of parking the signaling, stop the engine from
> >>> advancing so that the GPU doesn't emit the breadcrumb for our chosen
> >>> "guilty" request.
> >>>
> >>> Signed-off-by: Chris Wilson <[email protected]>
> >>> Cc: Mika Kuoppala <[email protected]>
> >>> Cc: Michał Winiarski <[email protected]>
> >>> CC: Michel Thierry <[email protected]>
> >>> Cc: Jeff McGee <[email protected]>
> >>> Cc: Tvrtko Ursulin <[email protected]>
> >>> ---
> >
> >>> diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c
> >>> b/drivers/gpu/drm/i915/intel_ringbuffer.c
> >>> index af5a178366ed..bb88a92fcc1e 100644
> >>> --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
> >>> +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
> >>> @@ -531,8 +531,26 @@ static int init_ring_common(struct intel_engine_cs
> >>> *engine)
> >>> return ret;
> >>> }
> >>>
> >>> +static void set_stop_engine(struct intel_engine_cs *engine)
> >>> +{
> >>> + struct drm_i915_private *dev_priv = engine->i915;
> >>> + const u32 base = engine->mmio_base;
> >>> + const i915_reg_t mode = RING_MI_MODE(base);
> >>> +
> >>> + I915_WRITE_FW(mode, _MASKED_BIT_ENABLE(STOP_RING));
> >>> + if (__intel_wait_for_register_fw(dev_priv,
> >>> + mode, MODE_IDLE, MODE_IDLE,
> >>> + 1000, 0,
> >>> + NULL))
> >>> + DRM_DEBUG_DRIVER("%s: timed out on STOP_RING\n",
> >>> + engine->name);
> >>> +}
> >>
> >> Looks to be exactly the same implementation as in intel_lrc.c apart from
> >> debug vs trace. Move to intel_engine_cs.c?
> >
> > Not unless you also suggest a name I like ;)
> >
> > Mika once had plans for engine->stop() so I didn't think too hard about
> > this bit, expecting it to be superseded.
>
> Vfunc, or intel_engine_stop, anything but two of the same.
>
> Timed out message should also possibly even be upgraded to notice level.
It doesn't make any difference unless an error occurs later. The
STOP_RING will time out, we know it does. MODE_IDLE will only be set on
the next arbitration point (aiui) :(
Such cases are usually only resolved by wedging as the reset itself also
fail.
> >>> static struct i915_request *reset_prepare(struct intel_engine_cs
> >>> *engine)
> >>> {
> >>> + if (INTEL_GEN(engine->i915) >= 3)
> >>> + set_stop_engine(engine);
> >>> +
> >>> if (engine->irq_seqno_barrier)
> >>> engine->irq_seqno_barrier(engine);
> >>>
> >>>
> >>
> >> Most important question - is stopping the engine expect to mostly work
> >> with a palette of different hang types?
> >
> > The good news is that if the engine is hung, it doesn't matter! So what
> > it reliably stops is Command Streamer execution along the ring, and we
> > only need to rely on it stopping before the MI_STORE_DWORD of the
> > breadcrumb to be sure that we won't see the breadcrumb advance as we
> > process reset.
> >
> > Now what's stopping the breadcrumb still being in flight (from the GPU)
> > as we sample it (from the CPU)?
> > Not much.
>
> If it reliably stops the command streamer then that sounds good enough
> to me.
>
> The in-flight write should be unlikely, since we detected a hang that
> means any activity happened long time ago.
Rule of thumb is to always use "hang" ;)
- we may have falsely declared a hang and the gpu was just slow
- another engine is hung, this one is chugging along oblivious
However, the uncached read is just what I would use to allow the write
from the GPU to be visible to the CPU, so maybe just add another?
-Chris
_______________________________________________
Intel-gfx mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/intel-gfx