Re: [PATCH] drm/amdgpu: fix nullptr error of amdgpu_vm_handle_moved

Felix Kuehling Fri, 08 Aug 2025 07:13:31 -0700


On 2025-08-08 5:21, Christian König wrote:
>
> On 08.08.25 05:14, Heng Zhou wrote:
>> If a amdgpu_bo_va is fpriv->prt_va, the bo of this one is always NULL.
>> So, such kind of amdgpu_bo_va should be updated separately before
>> amdgpu_vm_handle_moved.
>>
>> Signed-off-by: Heng Zhou <[email protected]>
>> ---
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 6 ++++++
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c           | 8 ++++++++
>>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h           | 1 +
>>  3 files changed, 15 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> index 37d8a7034a7e..e795b2970620 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
>> @@ -2970,6 +2970,12 @@ int amdgpu_amdkfd_gpuvm_restore_process_bos(void 
>> *info, struct dma_fence __rcu *
>>              struct amdgpu_device *adev = amdgpu_ttm_adev(
>>                      peer_vm->root.bo->tbo.bdev);
>>  
>> +            ret = amdgpu_vm_handle_prt_moved(adev,peer_vm);
>> +            if (ret) {
>> +                    pr_debug("Memory eviction: handle PRT moved failed. Try 
>> again\n");
>> +                    goto validate_map_fail;
>> +            }
>> +
>>              ret = amdgpu_vm_handle_moved(adev, peer_vm, &exec.ticket);
>>              if (ret) {
>>                      pr_debug("Memory eviction: handle moved failed. Try 
>> again\n");
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> index 283dd44f04b0..2c2a93f22ba0 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
>> @@ -1569,6 +1569,14 @@ int amdgpu_vm_clear_freed(struct amdgpu_device *adev,
>>  
>>  }
>>  
>> +int amdgpu_vm_handle_prt_moved(struct amdgpu_device *adev, struct amdgpu_vm 
>> *vm)
>> +{
>> +    struct amdgpu_fpriv *fpriv;
>> +
>> +    fpriv = container_of(vm, struct amdgpu_fpriv, vm);
> That's an absolute no-go from inside the VM code.
>
> You could maybe do that in the KFD code, but Felix and Philip need to decide 
> that.


We're calling amdgpu_vm_handle_moved in amdgpu_amdkfd_gpuvm_restore_process_bos 
to update mappings of BOs that are not managed by KFD. I don't think we care 
about PRT BOs, though. They are not useful for compute work.

What is the problem you're trying to fix here? Is amdgpu_vm_handle_moved 
crashing because of the PRT BO? In that case I'd try to make 
amdgpu_vm_handle_moved more robust to handle that situation gracefully (i.e. 
ignoring PRT BOs).

Regards,
  Felix


>
> Regards,
> Christian.
>
>> +    return amdgpu_vm_bo_update(adev, fpriv->prt_va, false);
>> +}
>> +
>>  /**
>>   * amdgpu_vm_handle_moved - handle moved BOs in the PT
>>   *
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>> index f9549f6b3d1f..853a66f9fd2e 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>> @@ -504,6 +504,7 @@ int amdgpu_vm_update_pdes(struct amdgpu_device *adev,
>>  int amdgpu_vm_clear_freed(struct amdgpu_device *adev,
>>                        struct amdgpu_vm *vm,
>>                        struct dma_fence **fence);
>> +int amdgpu_vm_handle_prt_moved(struct amdgpu_device *adev, struct amdgpu_vm 
>> *vm);
>>  int amdgpu_vm_handle_moved(struct amdgpu_device *adev,
>>                         struct amdgpu_vm *vm,
>>                         struct ww_acquire_ctx *ticket);

Re: [PATCH] drm/amdgpu: fix nullptr error of amdgpu_vm_handle_moved

Reply via email to