On Tue, Nov 15, 2016 at 04:36:33PM +0200, Mika Kuoppala wrote:
> As hangcheck score was removed, the active decay of score
> was removed also. This removed feature for hangcheck to detect
> if the gpu client was accidentally or maliciously causing intermittent
> hangs. Reinstate the scoring as a per context property, so that if
> one context starts to act unfavourably, ban it.
>
> v2: ban_period_secs as a gate to score check (Chris)
>
> Cc: Chris Wilson <[email protected]>
> Signed-off-by: Mika Kuoppala <[email protected]>
> - elapsed = get_seconds() - ctx->hang_stats.guilty_ts;
> - if (ctx->hang_stats.ban_period_seconds &&
> - elapsed <= ctx->hang_stats.ban_period_seconds) {
> + if (!hs->ban_period_seconds)
> + return false;
> +
> + elapsed = get_seconds() - hs->guilty_ts;
> + if (elapsed <= hs->ban_period_seconds) {
> DRM_DEBUG("context hanging too fast, banning!\n");
> return true;
> }
>
> + if (hs->ban_score >= 40) {
> + DRM_DEBUG("context hanging too often, banning!\n");
> + return true;
> + }
> +
> return false;
> }
> + hs->ban_score += 10;
This pair should be tunables (i.e. a macro somewhere sensible).
> diff --git a/drivers/gpu/drm/i915/i915_gem_request.c
> b/drivers/gpu/drm/i915/i915_gem_request.c
> index b9b5253..095c809 100644
> --- a/drivers/gpu/drm/i915/i915_gem_request.c
> +++ b/drivers/gpu/drm/i915/i915_gem_request.c
> @@ -204,6 +204,10 @@ static void i915_gem_request_retire(struct
> drm_i915_gem_request *request)
>
> trace_i915_gem_request_retire(request);
>
> + /* Retirement decays the ban score as it is a sign of ctx progress */
> + if (request->ctx->hang_stats.ban_score > 0)
> + request->ctx->hang_stats.ban_score--;
Please put this along with the other request->ctx updates (i.e. after
request->previos_context and before the context_put).
Otherwise lgtm.
-Chris
--
Chris Wilson, Intel Open Source Technology Centre
_______________________________________________
Intel-gfx mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/intel-gfx