Re: TDR and VRAM lost handling in KMD (v2)

Nicolai Hähnle Thu, 12 Oct 2017 05:18:04 -0700

On 12.10.2017 13:49, Christian König wrote:

Am 12.10.2017 um 13:37 schrieb Liu, Monk:
Hi team
Very good, many policy and implement are agreed, looks we only havesome arguments in amdgpu_ctx_query(), well I also confused with thecurrent implement of it, ☹First, I want to know if you guys agree that we*don't updatectx->reset_counter in amdgpu_ctx_query() *? because I want to make thequery result always consistent upon a given context,
That sounds like a good idea to me, but I'm not sure if it won't breakuserspace (I don't think so). Nicolai or Marek need to comment.


That should be fine for Mesa.

Second, I want to say that for KERNEL, it shouldn't use the term fromMESA or OGL or VULKAN, e.g. kernel shouldn't use AMDGPU_INNOCENT_RESETto map to GL_INNOCENT_RESET_ARB, etc...Because that way kernel will be very limited to certain UMD, so Isuggest we totally re-name the context status, and each UMD has itsown way to map the kernel context's result to gl-context/vk-context/etc…
Yes, completely agree.
Kernel should only provide below **FLAG bits** on a given context:

  * Define AMDGPU_CTX_STATE_GUILTY 0x1             //as long as TDR
    detects a job hang, KMD set the context behind this context as
    "guilty"
  * Define AMDGPU_CTX_STATE_VRAMLOST         0x2      //as long as
    there is a VRAM lost event hit after this context created, we mark
    this context "VRAM_LOST", so UMD can say that all BO under this
    context may lost their content, since kernel have no relationship
    between context and BO so this is UMD's call to judge if a BO
    considered "VRAM lost" or not.
  * Define AMDGPU_CTX_STATE_RESET   0x3     //as long as there is a
    gpu reset occurred after context creation, this flag shall be set
That sounds sane, but unfortunately might not be possible with theexisting IOCTL. Keep in mind that we need to keep backward compatibilityhere.

Agreed. FWIW, when Mesa sees an unknown (new) value being returned fromamdgpu_ctx_query, it treats it the same as AMDGPU_CTX_NO_RESET.


Cheers,
Nicolai

Sample code:
Int amdgpu_ctx_query(struct amdgpu_ctx_query_parm * out, …..) {
        if (ctx- >vram_lost_counter != adev->vram_lost_counter)
                out- >status |= AMDGPU_CTX_STATE_VRAM_LOST;
if (ctx- >reset_counter != adev→reset_counter){
out- >status |= AMDGPU_CTX_STATE_RESET;
if (ctx- >guilty == TRUE)
                        out- >status |= AMDGPU_CTX_STATE_GUILTY;
}
return 0;
}

For UMD if it found "out.status == 0" means there is no gpu reset evenoccurred, the context is totally regular,


  * *A****new IOCTL added for context:*

Void amdgpu_ctx_reinit(){
        Ctx→vram_lost_counter = adev→vram_lost_counter;
        Ctx→reset_counter = adev→reset_counter;
}


Mhm, is there any advantage to just creating a new context?

Regards,
Christian.

*if**UMD decide *not* to release the "guilty" context but continueusing it **after**UMD acknowledged GPU hang **on certain job/context,I suggest **UMD call "amdgpu_ctx_reinit()":*That way after you re-init() this context, you can get updated resultfrom "amdgpu_ctx_query", which will probably give you "out.status ==0" as long as no gpu reset/vram lost hit after re-init().

BR Monk
-----Original Message-----
From: Koenig, Christian
Sent: 2017年10月12日 18:13

To: Haehnle, Nicolai <[email protected]>; Michel Dänzer<[email protected]>; Liu, Monk <[email protected]>; Olsak, Marek<[email protected]>; Deucher, Alexander <[email protected]>;Zhou, David(ChunMing) <[email protected]>; Mao, David<[email protected]>Cc: Ramirez, Alejandro <[email protected]>;[email protected]; Filipas, Mario <[email protected]>;Ding, Pixel <[email protected]>; Li, Bingley <[email protected]>;Jiang, Jerry (SW) <[email protected]>

Subject: Re: TDR and VRAM lost handling in KMD (v2)
Am 12.10.2017 um 11:44 schrieb Nicolai Hähnle:
> On 12.10.2017 11:35, Michel Dänzer wrote:
>> On 12/10/17 11:23 AM, Christian König wrote:
>>> Am 12.10.2017 um 11:10 schrieb Nicolai Hähnle:
>>>> On 12.10.2017 10:49, Christian König wrote:
>>>>>> However, !guilty && ctx->reset_counter != adev->reset_counter
>>>>>> does not imply that the context was lost.
>>>>>>
>>>>>> The way I understand it, we should return
>>>>>> AMDGPU_CTX_INNOCENT_RESET if !guilty && ctx->vram_lost_counter != 
adev->vram_lost_counter.
>>>>>>
>>>>>> As far as I understand it, the case of !guilty &&
>>>>>> ctx->reset_counter != adev->reset_counter &&
>>>>>> ctx->vram_lost_counter ==
>>>>>> adev->vram_lost_counter should return AMDGPU_CTX_NO_RESET,
>>>>>> adev->because a
>>>>>> GPU reset occurred, but it didn't affect our context.
>>>>> I disagree on that.
>>>>>
>>>>> AMDGPU_CTX_INNOCENT_RESET just means what it does currently, there
>>>>> was a reset but we haven't been causing it.
>>>>>
>>>>> That the OpenGL extension is specified otherwise is unfortunate,
>>>>> but I think we shouldn't use that for the kernel interface here.
>>>> Two counterpoints:
>>>>
>>>> 1. Why should any application care that there was a reset while it
>>>> was idle? The OpenGL behavior is what makes sense.
>>>
>>> The application is certainly not interest if a reset happened or
>>> not, but I though that the driver stack might be.
>>>
>>>>
>>>> 2. AMDGPU_CTX_INNOCENT_RESET doesn't actually mean anything today
>>>> because we never return it :)
>>>>
>>>
>>> Good point.
>>>
>>>> amdgpu_ctx_query only ever returns AMDGPU_CTX_UNKNOWN_RESET, which
>>>> is in line with the OpenGL spec: we're conservatively returning
>>>> that a reset happened because we don't know whether the context was
>>>> affected, and we return UNKNOWN because we also don't know whether
>>>> the context was guilty or not.
>>>>
>>>> Returning AMDGPU_CTX_NO_RESET in the case of !guilty &&
>>>> ctx->vram_lost_counter == adev->vram_lost_counter is simply a
>>>> refinement and improvement of the current, overly conservative
>>>> behavior.
>>>
>>> Ok let's reenumerate what I think the different return values should
>>> mean:
>>>
>>> * AMDGPU_CTX_GUILTY_RESET
>>>
>>> guilty is set to true for this context.
>>>
>>> * AMDGPU_CTX_INNOCENT_RESET
>>>
>>> guilty is not set and vram_lost_counter has changed.
>>>
>>> * AMDGPU_CTX_UNKNOWN_RESET
>>>
>>> guilty is not set and vram_lost_counter has not changed, but
>>> gpu_reset_counter has changed.
>>
>> I don't understand the distinction you're proposing between
>> AMDGPU_CTX_INNOCENT_RESET and AMDGPU_CTX_UNKNOWN_RESET. I think both

>> cases you're describing should return either>> AMDGPU_CTX_INNOCENT_RESET, if the value of guilty is reliable, or

>> AMDGPU_CTX_UNKNOWN_RESET if it's not.
>

> I think it'd make more sense if it was called> "AMDGPU_CTX_UNAFFECTED_RESET".

>
> So:
> - AMDGPU_CTX_GUILTY_RESET --> the context was affected by a reset, and
> we know that it's the context's fault
> - AMDGPU_CTX_INNOCENT_RESET --> the context was affected by a reset,
> and we know that it *wasn't* the context's fault (no context job
> active)
> - AMDGPU_CTX_UNKNOWN_RESET --> the context was affected by a reset,
> and we don't know who's responsible (this could be returned in the
> unlikely case where context A's gfx job has not yet finished, but
> context B's gfx job has already started; it could be the fault of A,
> it could be the fault of B -- which somehow manages to hang a part of
> the hardware that then prevents A's job from finishing -- or it could
> be both; but it's a bit academic)
> - AMDGPU_CTX_UNAFFECTED_RESET --> there was a reset, but this context
> wasn't affected
>
> This last value would currently just be discarded by Mesa (because we
> should only disturb applications when we have to), but perhaps
> somebody else could find it useful?
Yes, that's what I had in mind as well.

Cause otherwise we would return AMDGPU_CTX_NO_RESET while thereactually was a reset and that certainly doesn't sound correct to me.

Regards,
Christian.
>
> Cheers,
> Nicolai


_______________________________________________
amd-gfx mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Re: TDR and VRAM lost handling in KMD (v2)

Reply via email to