On Tue, Jun 23, 2015 at 03:06:49PM +0100, Tomas Elf wrote: > On 23/06/2015 12:38, Daniel Vetter wrote: > >On Tue, Jun 23, 2015 at 11:47:16AM +0100, Tomas Elf wrote: > >>On 23/06/2015 11:05, Daniel Vetter wrote: > >>>Your patches don't apply cleanly any more and I can't find a suitable > >>>baseline where they would. But I'd like to see it all in context to check > >>>a few things. > >>> > >>>Can you pls push a git branch with these somewhere? > >>> > >> > >>Here's the baseline for my local tree: > >>cd07637 - drm-intel-nightly: 2015y-04m-13d-09h-46m-59s UTC integration > >>manifest <[email protected]> > > > >I don't have that baseline around here (any more at least). Happens > >regularly with rebasing trees. > > > >>I haven't updated it in a while obviously since I thought that could wait > >>until we'd worked our way through the RFC series and I could get to work on > >>the first real patch series. > >> > >>Is it possible for you to set up a local tree of your own with my baseline > >>and my RFC patches on top or would you prefer it if I push my branch to > >>drm-private? > > > >So yeah I need your branch ;-) > > I pushed my branch, > 20150608_TDR_upstream_adaptation_single-thread_hangchecking_RFC_delivery_sendmail_1, > to drm-private : > > https://git-amr-2.devtools.intel.com/gerrit/gitweb?p=otc_gen_graphics-drm-private.git;a=shortlog;h=refs/heads/20150608_TDR_upstream_adaptation_single-thread_hangchecking_RFC_delivery_sendmail_1
For public stuff something public is preferred since then I don't have to needle it through the firewall, which tends to be a pain and all that. Personally I just have a priv remote pointing to something public, and whenever I need to upload something for someone I do a $ git push priv +HEAD:for-$person The point of all this is that I can do a $ git fetch $url $branch $ git checkout FETCH_HEAD and look at the tree with the full power of local editors. I'll try and see whether I can coax gerrit into cooperation ... -Daniel > > Will that work or do you need something else? > > Thanks, > Tomas > > >-Danil > >> > >>Thanks, > >>Tomas > >> > >>>Thanks, Daniel > >>> > >>>On Mon, Jun 08, 2015 at 06:03:18PM +0100, Tomas Elf wrote: > >>>>This patch series introduces the following features: > >>>> > >>>>* Feature 1: TDR (Timeout Detection and Recovery) for gen8 execlist mode. > >>>> > >>>>TDR is an umbrella term for anything that goes into detecting and > >>>>recovering > >>>>from GPU hangs and is a term more widely used outside of the upstream > >>>>driver. > >>>>This feature introduces an extensible framework that currently supports > >>>>gen8 > >>>>but that can be easily extended to support gen7 as well (which is already > >>>>the > >>>>case in GMIN but unfortunately in a not quite upstreamable form). The code > >>>>contained in this submission represents the essentials of what is > >>>>currently in > >>>>GMIN merged with what is currently in upstream (as of the time when this > >>>>work > >>>>commenced a few months back). > >>>> > >>>>This feature adds a new hang recovery path alongside the legacy GPU reset > >>>>path, > >>>>which takes care of engine recovery only. Aside from adding support for > >>>>per-engine recovery this feature also introduces rules for how to promote > >>>>a > >>>>potential per-engine reset to a legacy, full GPU reset. > >>>> > >>>>The hang checker now integrates with the error handler in a slightly > >>>>different > >>>>way in that it allows hang recovery on multiple engines at the same time > >>>>by > >>>>passing an engine flag mask to the error handler where flags representing > >>>>all > >>>>of the hung engines are set. This allows us to schedule hang recovery > >>>>once for > >>>>all currently hung engines instead of one hang recovery per detected > >>>>engine > >>>>hang. Previously, when only full GPU reset was supported this was all the > >>>>same > >>>>since it wouldn't matter if one or four engines were hung at any given > >>>>point > >>>>since it would all amount to the same thing - the GPU getting reset. As it > >>>>stands now the behaviour is different depending on which engine is hung > >>>>since > >>>>each engine is reset separately from all the other engines, therefore we > >>>>have > >>>>to think about this in terms of scheduling cost and recovery latency. (see > >>>>open question below) > >>>> > >>>>OPEN QUESTIONS: > >>>> > >>>> 1. Do we want to investigate the possibility of per-engine hang > >>>> detection? In the current upstream driver there is only one work queue > >>>> that handles the hang checker and everything from initial hang > >>>> detection to final hang recovery runs in this thread. This makes sense > >>>> if you're only supporting one form of hang recovery - using full GPU > >>>> reset and nothing tied to any particular engine. However, as part > >>>> of this patch series we're changing that by introducing per-engine > >>>> hang recovery. It could make sense to introduce multiple work > >>>> queues - one per engine - to run multiple hang checking threads in > >>>> parallel. > >>>> > >>>> This would potentially allow savings in terms of recovery latency since > >>>> we don't have to scan all engines every time the hang checker is > >>>> scheduled and the error handler does not have to scan all engines every > >>>> time it is scheduled. Instead, we could implement one work queue per > >>>> engine that would invoke the hang checker that only checks _that_ > >>>> particular engine and then the error handler is invoked for _that_ > >>>> particular engine. If one engine has hung the latency for getting to > >>>> the hang recovery path for that particular engine would be (Time For > >>>> Hang Checking One Engine) + (Time For Error Handling One Engine) rather > >>>> than the time it takes to do hang checking for all engines + the time > >>>> it takes to do error handling for all engines that have been detected > >>>> as hung (which in the worst case would be all engines). There would > >>>> potentially be as many hang checker and error handling threads going on > >>>> concurrently as there are engines in the hardware but they would all be > >>>> running in parallel without any significant locking. The first time > >>>> where any thread needs exclusive access to the driver is at the point > >>>> of the actual hang recovery but the time it takes to get there would > >>>> theoretically be lower and the time it actually takes to do per-engine > >>>> hang recovery is quite a lot lower than the time it takes to actually > >>>> detect a hang reliably. > >>>> > >>>> How much we would save by such a change still needs to be analysed and > >>>> compared against the current single-thread model but it makes sense > >>>> from a theoretical design point of view. > >>>> > >>>> 2. How does per-engine reset integrate with the public reset stats > >>>> IOCTL? These stats are used for the GL robustness interface and > >>>> currently these tests are failing when running per-engine hang recovery > >>>> since we treat per-engine recovery differently from full GPU recovery, > >>>> which is nothing that userland knows anything about. When userland > >>>> expects to hang the hardware it expects the reset stat interface to > >>>> reflect this, which is something that has changed as part of this code > >>>> submission. There's more than one way to solve this. Here are two > >>>> options: > >>>> > >>>> 1. Expose per-engine reset statistics and set contexts as > >>>> guilty the same way for per-engine reset as for full GPU > >>>> resets. > >>>> > >>>> That would make this change to the hang recovery mechanism > >>>> transparent to userland but it would change the semantics since > >>>> an active context in the reset stats no longer implies that the > >>>> GPU was fully reset. > >>>> > >>>> 2. Add a new set of statistics for per-engine reset (one group > >>>> of statistics for each engine) to reflect the extended > >>>> capabilities that per-engine hang recovery offers. > >>>> > >>>> Would that be breaking the ABI? > >>>> > >>>> ... Or are there any other way of doing this? > >>>> > >>>>* Feature 2: Watchdog Timeout (a.k.a "media engine reset") for gen8. > >>>> > >>>>This feature allows userland applications to control whether or not > >>>>individual > >>>>batch buffers should have a first-level, fine-grained, hardware-based hang > >>>>detection mechanism on top of the ordinary, software-based periodic hang > >>>>checker that is already in the driver. The advantage over relying solely > >>>>on the > >>>>current software-based hang checker is that the watchdog timeout > >>>>mechanism is > >>>>about 1000x quicker and more precise. Since it's not a full driver-level > >>>>hang > >>>>detection mechanism but only targetting one individual batch buffer at a > >>>>time > >>>>it can afford to be that quick without risking an increase in false > >>>>positive > >>>>hang detection. > >>>> > >>>>This feature includes the following changes: > >>>> > >>>>a) Watchdog timeout interrupt service routine for handling watchdog > >>>>interrupts > >>>>and connecting these to per-engine hang recovery. > >>>> > >>>>b) Injection of watchdog timer enablement/cancellation instructions > >>>>before/after the batch buffer start instruction in the ring buffer so that > >>>>watchdog timeout is connected to the submission of an individual batch > >>>>buffer. > >>>> > >>>>c) Extension of the DRM batch buffer interface, exposing the watchdog > >>>>timeout > >>>>feature to userland. We've got two open source groups in VPG currently in > >>>>the > >>>>process of integrating support for this feature, which should make it > >>>>principally possible to upstream this extension. > >>>> > >>>>There is currently full watchdog timeout support for gen7 in GMIN and it > >>>>is > >>>>quite similar to the gen8 implementation so there is nothing obvious that > >>>>prevents us from upstreaming that code along with the gen8 code. However, > >>>>watchdog timeout is fully dependent on the per-engine hang recovery path > >>>>and > >>>>that is not part of this code submission for gen7. Therefore watchdog > >>>>timeout > >>>>support for gen7 has been excluded until per-engine hang recovery support > >>>>for > >>>>gen7 has landed upstream. > >>>> > >>>>As part of this submission we've had to reinstate the work queue that was > >>>>previously in place between the error handler and the hang recovery path. > >>>>The > >>>>reason for this is that the per-engine recovery path is called directly > >>>>from > >>>>the interrupt handler in the case of watchdog timeout. In that situation > >>>>there's no way of grabbing the struct_mutex, which is a requirement for > >>>>the > >>>>hang recovery path. Therefore, by reinstating the work queue we provide a > >>>>unified execution context for the hang recovery code that allows the hang > >>>>recovery code to grab whatever locks it needs without sacrificing > >>>>interrupt > >>>>latency too much or sleeping indefinitely in hard interrupt context. > >>>> > >>>>* Feature 3. Context Submission Status Consistency checking > >>>> > >>>>Something that becomes apparent when you run long-duration operations > >>>>tests > >>>>with concurrent rendering processes with intermittently injected hangs is > >>>>that > >>>>it seems like the GPU forgets to send context completion interrupts to the > >>>>driver under some circumstances. What this means is that the driver > >>>>sometimes > >>>>gets stuck on a context that never seems to finish, all the while the > >>>>hardware > >>>>has completed and is waiting for more work. > >>>> > >>>>The problem with this is that the per-engine hang recovery path relies on > >>>>context resubmission to kick off the hardware again following an engine > >>>>reset. > >>>>This can only be done safely if the hardware and driver share the same > >>>>opinion > >>>>about the current state. Therefore we've extended the periodic hang > >>>>checker to > >>>>check for context submission state inconsistencies aside from the hang > >>>>checking > >>>>it already does. > >>>> > >>>>If such a state is detected it is assumed (based on experience) that a > >>>>context > >>>>completion interrupt has been lost somehow. If this state persists for > >>>>some > >>>>time an attempt to correct it is made by faking the presumably lost > >>>>context > >>>>completion interrupt by manually calling the execlist interrupt handler, > >>>>which > >>>>is normally called from the main interrupt handler cued by a received > >>>>context > >>>>event interrupt. Just because an interrupt goes missing does not mean > >>>>that the > >>>>context status buffer (CSB) does not get appropriately updated by the > >>>>hardware, > >>>>which means that we can expect to find all the recent changes to the > >>>>context > >>>>states for each engine captured there. If there are outstanding context > >>>>status > >>>>changes in store there then the faked context event interrupt will allow > >>>>the > >>>>interrupt handler to act on them. In the case of lost context completion > >>>>interrupts this will prompt the driver to remove the already completed > >>>>context > >>>>from the execlist queue and move on to the next pending piece of work and > >>>>thereby eliminating the inconsistency. > >>>> > >>>>* Feature 4. Debugfs extensions for per-engine hang recovery and > >>>>TDR/watchdog trace > >>>>points. > >>>> > >>>> > >>>>Tomas Elf (11): > >>>> drm/i915: Early exit from semaphore_waits_for for execlist mode. > >>>> drm/i915: Introduce uevent for full GPU reset. > >>>> drm/i915: Add reset stats entry point for per-engine reset. > >>>> drm/i915: Adding TDR / per-engine reset support for gen8. > >>>> drm/i915: Extending i915_gem_check_wedge to check engine reset in > >>>> progress > >>>> drm/i915: Disable warnings for TDR interruptions in the display > >>>> driver. > >>>> drm/i915: Reinstate hang recovery work queue. > >>>> drm/i915: Watchdog timeout support for gen8. > >>>> drm/i915: Fake lost context interrupts through forced CSB check. > >>>> drm/i915: Debugfs interface for per-engine hang recovery. > >>>> drm/i915: TDR/watchdog trace points. > >>>> > >>>> drivers/gpu/drm/i915/i915_debugfs.c | 146 +++++- > >>>> drivers/gpu/drm/i915/i915_dma.c | 79 +++ > >>>> drivers/gpu/drm/i915/i915_drv.c | 201 ++++++++ > >>>> drivers/gpu/drm/i915/i915_drv.h | 91 +++- > >>>> drivers/gpu/drm/i915/i915_gem.c | 93 +++- > >>>> drivers/gpu/drm/i915/i915_gpu_error.c | 2 +- > >>>> drivers/gpu/drm/i915/i915_irq.c | 378 ++++++++++++-- > >>>> drivers/gpu/drm/i915/i915_params.c | 10 + > >>>> drivers/gpu/drm/i915/i915_reg.h | 13 + > >>>> drivers/gpu/drm/i915/i915_trace.h | 298 +++++++++++ > >>>> drivers/gpu/drm/i915/intel_display.c | 16 +- > >>>> drivers/gpu/drm/i915/intel_lrc.c | 858 > >>>> ++++++++++++++++++++++++++++++- > >>>> drivers/gpu/drm/i915/intel_lrc.h | 16 +- > >>>> drivers/gpu/drm/i915/intel_lrc_tdr.h | 40 ++ > >>>> drivers/gpu/drm/i915/intel_ringbuffer.c | 87 +++- > >>>> drivers/gpu/drm/i915/intel_ringbuffer.h | 109 ++++ > >>>> drivers/gpu/drm/i915/intel_uncore.c | 241 ++++++++- > >>>> include/uapi/drm/i915_drm.h | 5 +- > >>>> 18 files changed, 2589 insertions(+), 94 deletions(-) > >>>> create mode 100644 drivers/gpu/drm/i915/intel_lrc_tdr.h > >>>> > >>>>-- > >>>>1.7.9.5 > >>>> > >>>>_______________________________________________ > >>>>Intel-gfx mailing list > >>>>[email protected] > >>>>http://lists.freedesktop.org/mailman/listinfo/intel-gfx > >>> > >> > > > -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch _______________________________________________ Intel-gfx mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/intel-gfx
