Re: [PATCH] drm/amdgpu: fix gpu recovery disable with per queue reset

Lazar, Lijo Wed, 15 Jan 2025 00:30:07 -0800

On 1/14/2025 10:51 PM, Kim, Jonathan wrote:
> [Public]
> 
>> -----Original Message-----
>> From: Lazar, Lijo <[email protected]>
>> Sent: Friday, January 10, 2025 10:37 PM
>> To: Kim, Jonathan <[email protected]>; [email protected]
>> Cc: Kasiviswanathan, Harish <[email protected]>
>> Subject: Re: [PATCH] drm/amdgpu: fix gpu recovery disable with per queue 
>> reset
>>
>>
>>
>> On 1/11/2025 2:53 AM, Kim, Jonathan wrote:
>>> [Public]
>>>
>>>> -----Original Message-----
>>>> From: Lazar, Lijo <[email protected]>
>>>> Sent: Friday, January 10, 2025 11:29 AM
>>>> To: Kim, Jonathan <[email protected]>; [email protected]
>>>> Cc: Kasiviswanathan, Harish <[email protected]>
>>>> Subject: Re: [PATCH] drm/amdgpu: fix gpu recovery disable with per queue 
>>>> reset
>>>>
>>>>
>>>>
>>>> On 1/10/2025 9:43 PM, Kim, Jonathan wrote:
>>>>> [Public]
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Lazar, Lijo <[email protected]>
>>>>>> Sent: Thursday, January 9, 2025 10:39 PM
>>>>>> To: Kim, Jonathan <[email protected]>; [email protected]
>>>>>> Cc: Kasiviswanathan, Harish <[email protected]>
>>>>>> Subject: Re: [PATCH] drm/amdgpu: fix gpu recovery disable with per queue
>> reset
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 1/9/2025 8:27 PM, Kim, Jonathan wrote:
>>>>>>> [Public]
>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Lazar, Lijo <[email protected]>
>>>>>>>> Sent: Thursday, January 9, 2025 1:14 AM
>>>>>>>> To: Kim, Jonathan <[email protected]>; amd-
>> [email protected]
>>>>>>>> Cc: Kasiviswanathan, Harish <[email protected]>
>>>>>>>> Subject: Re: [PATCH] drm/amdgpu: fix gpu recovery disable with per 
>>>>>>>> queue
>>>> reset
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 1/9/2025 1:31 AM, Jonathan Kim wrote:
>>>>>>>>> Per queue reset should be bypassed when gpu recovery is disabled
>>>>>>>>> with module parameter.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Jonathan Kim <[email protected]>
>>>>>>>>> ---
>>>>>>>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c | 6 ++++++
>>>>>>>>>  1 file changed, 6 insertions(+)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c
>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c
>>>>>>>>> index cc66ebb7bae1..441568163e20 100644
>>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c
>>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gfx_v9.c
>>>>>>>>> @@ -1131,6 +1131,9 @@ uint64_t kgd_gfx_v9_hqd_get_pq_addr(struct
>>>>>>>> amdgpu_device *adev,
>>>>>>>>>     uint32_t low, high;
>>>>>>>>>     uint64_t queue_addr = 0;
>>>>>>>>>
>>>>>>>>> +   if (!amdgpu_gpu_recovery)
>>>>>>>>> +           return 0;
>>>>>>>>> +
>>>>>>>>>     kgd_gfx_v9_acquire_queue(adev, pipe_id, queue_id, inst);
>>>>>>>>>     amdgpu_gfx_rlc_enter_safe_mode(adev, inst);
>>>>>>>>>
>>>>>>>>> @@ -1179,6 +1182,9 @@ uint64_t kgd_gfx_v9_hqd_reset(struct
>>>>>> amdgpu_device
>>>>>>>> *adev,
>>>>>>>>>     uint32_t low, high, pipe_reset_data = 0;
>>>>>>>>>     uint64_t queue_addr = 0;
>>>>>>>>>
>>>>>>>>> +   if (!amdgpu_gpu_recovery)
>>>>>>>>> +           return 0;
>>>>>>>>> +
>>>>>>>>
>>>>>>>> I think the right place for this check is not inside callback, should 
>>>>>>>> be
>>>>>>>> from the place where the callback gets called.
>>>>>>>
>>>>>>> I don't think it really matters.  We're going to have different reset 
>>>>>>> types in the
>>>> future
>>>>>> that my come from different callers.
>>>>>>> It's probably easier to remember to put the bypass where the reset is 
>>>>>>> actually
>>>>>> happening.
>>>>>>>
>>>>>>
>>>>>> If I want to define something like amdgpu_gpu_recovery=2 (don't do queue
>>>>>> reset but perform other resets), then it matters.
>>>>>
>>>>> I don't get why that matters.
>>>>> This callback alone, for example, calls 2 types of resets within itself 
>>>>> (queue then
>>>> pipe).
>>>>> If you wanted partial resets between queue and pipe in this case, you'd 
>>>>> have to
>>>> branch test within the callback itself.
>>>>> GPU reset bypass checks are invisible to the KFD section of the code as 
>>>>> well.
>>>>>
>>>>>>
>>>>>> Since this is a callback, keeping it at the wrapper place may be more
>>>>>> maintainable rather than keeping the check for gfx10/11/12 etc.
>>>>>
>>>>> Maybe not.  MES is preemption checks are not like MEC preemption checks at
>> all.
>>>>> And we probably don't want to jam other future IP resets into a single 
>>>>> caller.
>>>>> If you look at the kfd2kgd callbacks, most are pretty much copy and paste 
>>>>> from
>> one
>>>> generation to another.
>>>>> I don't see how putting the test in the callback makes it less 
>>>>> maintainable.
>>>>>
>>>>
>>>> My thought process was this could be put in - reset_queues_on_hws_hang
>>>> and anything similar handles MES based queue resets. What you are saying
>>>> there won't be anything like reset_queue_by_mes() for MES based resets.
>>>> Is that correct?
>>>
>>> No the opposite.  But now we'd have to remember to put it in 2 places where
>> there's still no visible test for gpu reset bypass in the same file.
>>> SDMA resets are also being implemented and will probably have to be called 
>>> in
>> different places in the KFD as well.
>>> We can look at consolidating this later as more devices and IPs get done if 
>>> it
>> makes sense to abstract this stuff.
>>> My point is, the callback does the reset and returns a reset status.
>>> Bypassing by fail return seems easier to remember and leverage.
>>
>> Ok, we have SDMA queue reset called from job timeouts. If it's getting
>> called from KFD too, will look at consolidating that one.
>>
>> BTW, if this is returning a fake success, won't it result in a print
>> like queue reset succeeded which gives the false impression that queue
>> reset happened? Or, should it return a different error code like
>> 'ECANCELED' since operation is intentionally skipped through module param
> 
> Well ... the call is supposed to return an address of which queue got reset 
> where a null return indicates "no queue got reset".

We should also somehow indicate to the user that the queue indeed ran
into a reset situation. Not sure if that is taken care if address is
returned as NULL.

> I'm thinking to make this simpler, maybe we change reset_queues_on_hws into a 
> wrapper that takes in a queue type input (compute, sdma etc) and branches to 
> the right reset call based on input type.
> That way we only need 1 place to do the gpu_recovery enablement check in the 
> KFD, and the KFD has the flexibility to call this wrapper where ever it wants 
> to without having to worry about the module parameter status in the future.

Yes, this was the intention of original comment - to add at caller place
rather than inside callback. It provides a one-place (or fewer places)
control of the usage.

While at it, suggest to use amdgpu_device_should_recover_gpu(). This
will give an info message if recovery is disabled, and we could control
different meanings of this param (Ex: gpu_recovery = 2, avoid sdma queue
reset alone) through the same function.

Thanks,
Lijo

> 
> Jon
> 
>>
>> Thanks,
>> Lijo
>>
>>> That being said, putting the test in hqd_get_pq_addr was probably overkill, 
>>> but I
>> don't think anyone really cares to use it with gpu recovery off on its own 
>> at the
>> moment.
>>>
>>> Jon
>>>
>>>>
>>>> Thanks,
>>>> Lijo
>>>>
>>>>> Jon
>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Lijo
>>>>>>
>>>>>>> Jon
>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Lijo
>>>>>>>>
>>>>>>>>>     kgd_gfx_v9_acquire_queue(adev, pipe_id, queue_id, inst);
>>>>>>>>>     amdgpu_gfx_rlc_enter_safe_mode(adev, inst);
>>>>>>>>>
>>>>>>>
>>>>>
>>>
>
Re: [PATCH] drm/amdgpu: fix gpu recovery disable with per queue reset

Reply via email to