On 08.08.25 16:13, Felix Kuehling wrote:
> 
> On 2025-08-08 5:21, Christian König wrote:
>>
>> On 08.08.25 05:14, Heng Zhou wrote:
>>> If a amdgpu_bo_va is fpriv->prt_va, the bo of this one is always NULL.
>>> So, such kind of amdgpu_bo_va should be updated separately before
>>> amdgpu_vm_handle_moved.
>>>
>>> Signed-off-by: Heng Zhou <[email protected]>
>>> ---
>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 6 ++++++
>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c           | 8 ++++++++
>>>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h           | 1 +
>>>  3 files changed, 15 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>> index 37d8a7034a7e..e795b2970620 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>>> @@ -2970,6 +2970,12 @@ int amdgpu_amdkfd_gpuvm_restore_process_bos(void 
>>> *info, struct dma_fence __rcu *
>>>             struct amdgpu_device *adev = amdgpu_ttm_adev(
>>>                     peer_vm->root.bo->tbo.bdev);
>>>  
>>> +           ret = amdgpu_vm_handle_prt_moved(adev,peer_vm);
>>> +           if (ret) {
>>> +                   pr_debug("Memory eviction: handle PRT moved failed. Try 
>>> again\n");
>>> +                   goto validate_map_fail;
>>> +           }
>>> +
>>>             ret = amdgpu_vm_handle_moved(adev, peer_vm, &exec.ticket);
>>>             if (ret) {
>>>                     pr_debug("Memory eviction: handle moved failed. Try 
>>> again\n");
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> index 283dd44f04b0..2c2a93f22ba0 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>>> @@ -1569,6 +1569,14 @@ int amdgpu_vm_clear_freed(struct amdgpu_device *adev,
>>>  
>>>  }
>>>  
>>> +int amdgpu_vm_handle_prt_moved(struct amdgpu_device *adev, struct 
>>> amdgpu_vm *vm)
>>> +{
>>> +   struct amdgpu_fpriv *fpriv;
>>> +
>>> +   fpriv = container_of(vm, struct amdgpu_fpriv, vm);
>> That's an absolute no-go from inside the VM code.
>>
>> You could maybe do that in the KFD code, but Felix and Philip need to decide 
>> that.
> 
> We're calling amdgpu_vm_handle_moved in 
> amdgpu_amdkfd_gpuvm_restore_process_bos to update mappings of BOs that are 
> not managed by KFD. I don't think we care about PRT BOs, though. They are not 
> useful for compute work.
> 
> What is the problem you're trying to fix here? Is amdgpu_vm_handle_moved 
> crashing because of the PRT BO? In that case I'd try to make 
> amdgpu_vm_handle_moved more robust to handle that situation gracefully (i.e. 
> ignoring PRT BOs).

The issue is that amdgpu_vm_handle_moved() only takes care of moved BOs, but 
not PRT mappings since they can never "move" because they don't have an 
associated BO.

But after a GPU reset the all mappings are marked invalid to re-create the page 
tables, so the PRT mappings ended up on the "moved" list as well and that is 
not handled by amdgpu_vm_handle_moved() so you end up in a crash.

But the real question is how we end up in this situation in the first place? A 
GPU reset should kill the KFD process, so there should never be a restore 
attempted.

Regards,
Christian.

> 
> Regards,
>   Felix
> 
> 
>>
>> Regards,
>> Christian.
>>
>>> +   return amdgpu_vm_bo_update(adev, fpriv->prt_va, false);
>>> +}
>>> +
>>>  /**
>>>   * amdgpu_vm_handle_moved - handle moved BOs in the PT
>>>   *
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>>> index f9549f6b3d1f..853a66f9fd2e 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>>> @@ -504,6 +504,7 @@ int amdgpu_vm_update_pdes(struct amdgpu_device *adev,
>>>  int amdgpu_vm_clear_freed(struct amdgpu_device *adev,
>>>                       struct amdgpu_vm *vm,
>>>                       struct dma_fence **fence);
>>> +int amdgpu_vm_handle_prt_moved(struct amdgpu_device *adev, struct 
>>> amdgpu_vm *vm);
>>>  int amdgpu_vm_handle_moved(struct amdgpu_device *adev,
>>>                        struct amdgpu_vm *vm,
>>>                        struct ww_acquire_ctx *ticket);

Reply via email to