watchdog timeout support for gen8

Daniel Vetter Tue, 23 Jun 2015 08:18:35 -0700

On Tue, Jun 23, 2015 at 03:06:49PM +0100, Tomas Elf wrote:
> On 23/06/2015 12:38, Daniel Vetter wrote:
> >On Tue, Jun 23, 2015 at 11:47:16AM +0100, Tomas Elf wrote:
> >>On 23/06/2015 11:05, Daniel Vetter wrote:
> >>>Your patches don't apply cleanly any more and I can't find a suitable
> >>>baseline where they would. But I'd like to see it all in context to check
> >>>a few things.
> >>>
> >>>Can you pls push a git branch with these somewhere?
> >>>
> >>
> >>Here's the baseline for my local tree:
> >>cd07637 -  drm-intel-nightly: 2015y-04m-13d-09h-46m-59s UTC integration
> >>manifest <[email protected]>
> >
> >I don't have that baseline around here (any more at least). Happens
> >regularly with rebasing trees.
> >
> >>I haven't updated it in a while obviously since I thought that could wait
> >>until we'd worked our way through the RFC series and I could get to work on
> >>the first real patch series.
> >>
> >>Is it possible for you to set up a local tree of your own with my baseline
> >>and my RFC patches on top or would you prefer it if I push my branch to
> >>drm-private?
> >
> >So yeah I need your branch ;-)
> 
> I pushed my branch, 
> 20150608_TDR_upstream_adaptation_single-thread_hangchecking_RFC_delivery_sendmail_1,
> to drm-private :
> 
> https://git-amr-2.devtools.intel.com/gerrit/gitweb?p=otc_gen_graphics-drm-private.git;a=shortlog;h=refs/heads/20150608_TDR_upstream_adaptation_single-thread_hangchecking_RFC_delivery_sendmail_1


For public stuff something public is preferred since then I don't have to
needle it through the firewall, which tends to be a pain and all that.
Personally I just have a priv remote pointing to something public, and
whenever I need to upload something for someone I do a

$ git push priv +HEAD:for-$person

The point of all this is that I can do a

$ git fetch $url $branch
$ git checkout FETCH_HEAD

and look at the tree with the full power of local editors. I'll try and
see whether I can coax gerrit into cooperation ...
-Daniel

> 
> Will that work or do you need something else?
> 
> Thanks,
> Tomas
> 
> >-Danil
> >>
> >>Thanks,
> >>Tomas
> >>
> >>>Thanks, Daniel
> >>>
> >>>On Mon, Jun 08, 2015 at 06:03:18PM +0100, Tomas Elf wrote:
> >>>>This patch series introduces the following features:
> >>>>
> >>>>* Feature 1: TDR (Timeout Detection and Recovery) for gen8 execlist mode.
> >>>>
> >>>>TDR is an umbrella term for anything that goes into detecting and 
> >>>>recovering
> >>>>from GPU hangs and is a term more widely used outside of the upstream 
> >>>>driver.
> >>>>This feature introduces an extensible framework that currently supports 
> >>>>gen8
> >>>>but that can be easily extended to support gen7 as well (which is already 
> >>>>the
> >>>>case in GMIN but unfortunately in a not quite upstreamable form). The code
> >>>>contained in this submission represents the essentials of what is 
> >>>>currently in
> >>>>GMIN merged with what is currently in upstream (as of the time when this 
> >>>>work
> >>>>commenced a few months back).
> >>>>
> >>>>This feature adds a new hang recovery path alongside the legacy GPU reset 
> >>>>path,
> >>>>which takes care of engine recovery only. Aside from adding support for
> >>>>per-engine recovery this feature also introduces rules for how to promote 
> >>>>a
> >>>>potential per-engine reset to a legacy, full GPU reset.
> >>>>
> >>>>The hang checker now integrates with the error handler in a slightly 
> >>>>different
> >>>>way in that it allows hang recovery on multiple engines at the same time 
> >>>>by
> >>>>passing an engine flag mask to the error handler where flags representing 
> >>>>all
> >>>>of the hung engines are set. This allows us to schedule hang recovery 
> >>>>once for
> >>>>all currently hung engines instead of one hang recovery per detected 
> >>>>engine
> >>>>hang. Previously, when only full GPU reset was supported this was all the 
> >>>>same
> >>>>since it wouldn't matter if one or four engines were hung at any given 
> >>>>point
> >>>>since it would all amount to the same thing - the GPU getting reset. As it
> >>>>stands now the behaviour is different depending on which engine is hung 
> >>>>since
> >>>>each engine is reset separately from all the other engines, therefore we 
> >>>>have
> >>>>to think about this in terms of scheduling cost and recovery latency. (see
> >>>>open question below)
> >>>>
> >>>>OPEN QUESTIONS:
> >>>>
> >>>>  1. Do we want to investigate the possibility of per-engine hang
> >>>>  detection? In the current upstream driver there is only one work queue
> >>>>  that handles the hang checker and everything from initial hang
> >>>>  detection to final hang recovery runs in this thread. This makes sense
> >>>>  if you're only supporting one form of hang recovery - using full GPU
> >>>>  reset and nothing tied to any particular engine. However, as part
> >>>>  of this patch series we're changing that by introducing per-engine
> >>>>  hang recovery. It could make sense to introduce multiple work
> >>>>  queues - one per engine - to run multiple hang checking threads in
> >>>>  parallel.
> >>>>
> >>>>  This would potentially allow savings in terms of recovery latency since
> >>>>  we don't have to scan all engines every time the hang checker is
> >>>>  scheduled and the error handler does not have to scan all engines every
> >>>>  time it is scheduled. Instead, we could implement one work queue per
> >>>>  engine that would invoke the hang checker that only checks _that_
> >>>>  particular engine and then the error handler is invoked for _that_
> >>>>  particular engine. If one engine has hung the latency for getting to
> >>>>  the hang recovery path for that particular engine would be (Time For
> >>>>  Hang Checking One Engine) + (Time For Error Handling One Engine) rather
> >>>>  than the time it takes to do hang checking for all engines + the time
> >>>>  it takes to do error handling for all engines that have been detected
> >>>>  as hung (which in the worst case would be all engines). There would
> >>>>  potentially be as many hang checker and error handling threads going on
> >>>>  concurrently as there are engines in the hardware but they would all be
> >>>>  running in parallel without any significant locking. The first time
> >>>>  where any thread needs exclusive access to the driver is at the point
> >>>>  of the actual hang recovery but the time it takes to get there would
> >>>>  theoretically be lower and the time it actually takes to do per-engine
> >>>>  hang recovery is quite a lot lower than the time it takes to actually
> >>>>  detect a hang reliably.
> >>>>
> >>>>  How much we would save by such a change still needs to be analysed and
> >>>>  compared against the current single-thread model but it makes sense
> >>>>  from a theoretical design point of view.
> >>>>
> >>>>  2. How does per-engine reset integrate with the public reset stats
> >>>>  IOCTL? These stats are used for the GL robustness interface and
> >>>>  currently these tests are failing when running per-engine hang recovery
> >>>>  since we treat per-engine recovery differently from full GPU recovery,
> >>>>  which is nothing that userland knows anything about. When userland
> >>>>  expects to hang the hardware it expects the reset stat interface to
> >>>>  reflect this, which is something that has changed as part of this code
> >>>>  submission. There's more than one way to solve this. Here are two 
> >>>> options:
> >>>>
> >>>>          1. Expose per-engine reset statistics and set contexts as
> >>>>          guilty the same way for per-engine reset as for full GPU
> >>>>          resets.
> >>>>
> >>>>          That would make this change to the hang recovery mechanism
> >>>>          transparent to userland but it would change the semantics since
> >>>>          an active context in the reset stats no longer implies that the
> >>>>          GPU was fully reset.
> >>>>
> >>>>          2. Add a new set of statistics for per-engine reset (one group
> >>>>          of statistics for each engine) to reflect the extended
> >>>>          capabilities that per-engine hang recovery offers.
> >>>>
> >>>>          Would that be breaking the ABI?
> >>>>
> >>>>          ... Or are there any other way of doing this?
> >>>>
> >>>>* Feature 2: Watchdog Timeout (a.k.a "media engine reset") for gen8.
> >>>>
> >>>>This feature allows userland applications to control whether or not 
> >>>>individual
> >>>>batch buffers should have a first-level, fine-grained, hardware-based hang
> >>>>detection mechanism on top of the ordinary, software-based periodic hang
> >>>>checker that is already in the driver. The advantage over relying solely 
> >>>>on the
> >>>>current software-based hang checker is that the watchdog timeout 
> >>>>mechanism is
> >>>>about 1000x quicker and more precise. Since it's not a full driver-level 
> >>>>hang
> >>>>detection mechanism but only targetting one individual batch buffer at a 
> >>>>time
> >>>>it can afford to be that quick without risking an increase in false 
> >>>>positive
> >>>>hang detection.
> >>>>
> >>>>This feature includes the following changes:
> >>>>
> >>>>a) Watchdog timeout interrupt service routine for handling watchdog 
> >>>>interrupts
> >>>>and connecting these to per-engine hang recovery.
> >>>>
> >>>>b) Injection of watchdog timer enablement/cancellation instructions
> >>>>before/after the batch buffer start instruction in the ring buffer so that
> >>>>watchdog timeout is connected to the submission of an individual batch 
> >>>>buffer.
> >>>>
> >>>>c) Extension of the DRM batch buffer interface, exposing the watchdog 
> >>>>timeout
> >>>>feature to userland. We've got two open source groups in VPG currently in 
> >>>>the
> >>>>process of integrating support for this feature, which should make it
> >>>>principally possible to upstream this extension.
> >>>>
> >>>>There is currently full watchdog timeout support for gen7 in GMIN and it 
> >>>>is
> >>>>quite similar to the gen8 implementation so there is nothing obvious that
> >>>>prevents us from upstreaming that code along with the gen8 code. However,
> >>>>watchdog timeout is fully dependent on the per-engine hang recovery path 
> >>>>and
> >>>>that is not part of this code submission for gen7. Therefore watchdog 
> >>>>timeout
> >>>>support for gen7 has been excluded until per-engine hang recovery support 
> >>>>for
> >>>>gen7 has landed upstream.
> >>>>
> >>>>As part of this submission we've had to reinstate the work queue that was
> >>>>previously in place between the error handler and the hang recovery path. 
> >>>>The
> >>>>reason for this is that the per-engine recovery path is called directly 
> >>>>from
> >>>>the interrupt handler in the case of watchdog timeout. In that situation
> >>>>there's no way of grabbing the struct_mutex, which is a requirement for 
> >>>>the
> >>>>hang recovery path. Therefore, by reinstating the work queue we provide a
> >>>>unified execution context for the hang recovery code that allows the hang
> >>>>recovery code to grab whatever locks it needs without sacrificing 
> >>>>interrupt
> >>>>latency too much or sleeping indefinitely in hard interrupt context.
> >>>>
> >>>>* Feature 3. Context Submission Status Consistency checking
> >>>>
> >>>>Something that becomes apparent when you run long-duration operations 
> >>>>tests
> >>>>with concurrent rendering processes with intermittently injected hangs is 
> >>>>that
> >>>>it seems like the GPU forgets to send context completion interrupts to the
> >>>>driver under some circumstances. What this means is that the driver 
> >>>>sometimes
> >>>>gets stuck on a context that never seems to finish, all the while the 
> >>>>hardware
> >>>>has completed and is waiting for more work.
> >>>>
> >>>>The problem with this is that the per-engine hang recovery path relies on
> >>>>context resubmission to kick off the hardware again following an engine 
> >>>>reset.
> >>>>This can only be done safely if the hardware and driver share the same 
> >>>>opinion
> >>>>about the current state. Therefore we've extended the periodic hang 
> >>>>checker to
> >>>>check for context submission state inconsistencies aside from the hang 
> >>>>checking
> >>>>it already does.
> >>>>
> >>>>If such a state is detected it is assumed (based on experience) that a 
> >>>>context
> >>>>completion interrupt has been lost somehow. If this state persists for 
> >>>>some
> >>>>time an attempt to correct it is made by faking the presumably lost 
> >>>>context
> >>>>completion interrupt by manually calling the execlist interrupt handler, 
> >>>>which
> >>>>is normally called from the main interrupt handler cued by a received 
> >>>>context
> >>>>event interrupt. Just because an interrupt goes missing does not mean 
> >>>>that the
> >>>>context status buffer (CSB) does not get appropriately updated by the 
> >>>>hardware,
> >>>>which means that we can expect to find all the recent changes to the 
> >>>>context
> >>>>states for each engine captured there. If there are outstanding context 
> >>>>status
> >>>>changes in store there then the faked context event interrupt will allow 
> >>>>the
> >>>>interrupt handler to act on them. In the case of lost context completion
> >>>>interrupts this will prompt the driver to remove the already completed 
> >>>>context
> >>>>from the execlist queue and move on to the next pending piece of work and
> >>>>thereby eliminating the inconsistency.
> >>>>
> >>>>* Feature 4. Debugfs extensions for per-engine hang recovery and 
> >>>>TDR/watchdog trace
> >>>>points.
> >>>>
> >>>>
> >>>>Tomas Elf (11):
> >>>>   drm/i915: Early exit from semaphore_waits_for for execlist mode.
> >>>>   drm/i915: Introduce uevent for full GPU reset.
> >>>>   drm/i915: Add reset stats entry point for per-engine reset.
> >>>>   drm/i915: Adding TDR / per-engine reset support for gen8.
> >>>>   drm/i915: Extending i915_gem_check_wedge to check engine reset in
> >>>>     progress
> >>>>   drm/i915: Disable warnings for TDR interruptions in the display
> >>>>     driver.
> >>>>   drm/i915: Reinstate hang recovery work queue.
> >>>>   drm/i915: Watchdog timeout support for gen8.
> >>>>   drm/i915: Fake lost context interrupts through forced CSB check.
> >>>>   drm/i915: Debugfs interface for per-engine hang recovery.
> >>>>   drm/i915: TDR/watchdog trace points.
> >>>>
> >>>>  drivers/gpu/drm/i915/i915_debugfs.c     |  146 +++++-
> >>>>  drivers/gpu/drm/i915/i915_dma.c         |   79 +++
> >>>>  drivers/gpu/drm/i915/i915_drv.c         |  201 ++++++++
> >>>>  drivers/gpu/drm/i915/i915_drv.h         |   91 +++-
> >>>>  drivers/gpu/drm/i915/i915_gem.c         |   93 +++-
> >>>>  drivers/gpu/drm/i915/i915_gpu_error.c   |    2 +-
> >>>>  drivers/gpu/drm/i915/i915_irq.c         |  378 ++++++++++++--
> >>>>  drivers/gpu/drm/i915/i915_params.c      |   10 +
> >>>>  drivers/gpu/drm/i915/i915_reg.h         |   13 +
> >>>>  drivers/gpu/drm/i915/i915_trace.h       |  298 +++++++++++
> >>>>  drivers/gpu/drm/i915/intel_display.c    |   16 +-
> >>>>  drivers/gpu/drm/i915/intel_lrc.c        |  858 
> >>>> ++++++++++++++++++++++++++++++-
> >>>>  drivers/gpu/drm/i915/intel_lrc.h        |   16 +-
> >>>>  drivers/gpu/drm/i915/intel_lrc_tdr.h    |   40 ++
> >>>>  drivers/gpu/drm/i915/intel_ringbuffer.c |   87 +++-
> >>>>  drivers/gpu/drm/i915/intel_ringbuffer.h |  109 ++++
> >>>>  drivers/gpu/drm/i915/intel_uncore.c     |  241 ++++++++-
> >>>>  include/uapi/drm/i915_drm.h             |    5 +-
> >>>>  18 files changed, 2589 insertions(+), 94 deletions(-)
> >>>>  create mode 100644 drivers/gpu/drm/i915/intel_lrc_tdr.h
> >>>>
> >>>>--
> >>>>1.7.9.5
> >>>>
> >>>>_______________________________________________
> >>>>Intel-gfx mailing list
> >>>>[email protected]
> >>>>http://lists.freedesktop.org/mailman/listinfo/intel-gfx
> >>>
> >>
> >
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
_______________________________________________
Intel-gfx mailing list
[email protected]
http://lists.freedesktop.org/mailman/listinfo/intel-gfx

Re: [Intel-gfx] [RFC 00/11] TDR/watchdog timeout support for gen8

Reply via email to