On 2025-08-08 5:21, Christian König wrote: > > On 08.08.25 05:14, Heng Zhou wrote: >> If a amdgpu_bo_va is fpriv->prt_va, the bo of this one is always NULL. >> So, such kind of amdgpu_bo_va should be updated separately before >> amdgpu_vm_handle_moved. >> >> Signed-off-by: Heng Zhou <[email protected]> >> --- >> drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c | 6 ++++++ >> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 8 ++++++++ >> drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 1 + >> 3 files changed, 15 insertions(+) >> >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c >> index 37d8a7034a7e..e795b2970620 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c >> @@ -2970,6 +2970,12 @@ int amdgpu_amdkfd_gpuvm_restore_process_bos(void >> *info, struct dma_fence __rcu * >> struct amdgpu_device *adev = amdgpu_ttm_adev( >> peer_vm->root.bo->tbo.bdev); >> >> + ret = amdgpu_vm_handle_prt_moved(adev,peer_vm); >> + if (ret) { >> + pr_debug("Memory eviction: handle PRT moved failed. Try >> again\n"); >> + goto validate_map_fail; >> + } >> + >> ret = amdgpu_vm_handle_moved(adev, peer_vm, &exec.ticket); >> if (ret) { >> pr_debug("Memory eviction: handle moved failed. Try >> again\n"); >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c >> index 283dd44f04b0..2c2a93f22ba0 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c >> @@ -1569,6 +1569,14 @@ int amdgpu_vm_clear_freed(struct amdgpu_device *adev, >> >> } >> >> +int amdgpu_vm_handle_prt_moved(struct amdgpu_device *adev, struct amdgpu_vm >> *vm) >> +{ >> + struct amdgpu_fpriv *fpriv; >> + >> + fpriv = container_of(vm, struct amdgpu_fpriv, vm); > That's an absolute no-go from inside the VM code. > > You could maybe do that in the KFD code, but Felix and Philip need to decide > that.
We're calling amdgpu_vm_handle_moved in amdgpu_amdkfd_gpuvm_restore_process_bos to update mappings of BOs that are not managed by KFD. I don't think we care about PRT BOs, though. They are not useful for compute work. What is the problem you're trying to fix here? Is amdgpu_vm_handle_moved crashing because of the PRT BO? In that case I'd try to make amdgpu_vm_handle_moved more robust to handle that situation gracefully (i.e. ignoring PRT BOs). Regards, Felix > > Regards, > Christian. > >> + return amdgpu_vm_bo_update(adev, fpriv->prt_va, false); >> +} >> + >> /** >> * amdgpu_vm_handle_moved - handle moved BOs in the PT >> * >> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h >> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h >> index f9549f6b3d1f..853a66f9fd2e 100644 >> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h >> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h >> @@ -504,6 +504,7 @@ int amdgpu_vm_update_pdes(struct amdgpu_device *adev, >> int amdgpu_vm_clear_freed(struct amdgpu_device *adev, >> struct amdgpu_vm *vm, >> struct dma_fence **fence); >> +int amdgpu_vm_handle_prt_moved(struct amdgpu_device *adev, struct amdgpu_vm >> *vm); >> int amdgpu_vm_handle_moved(struct amdgpu_device *adev, >> struct amdgpu_vm *vm, >> struct ww_acquire_ctx *ticket);
