Am 23.07.24 um 11:04 schrieb Yin, ZhenGuo (Chris):
[AMD Official Use Only - AMD Internal Distribution Only]

Hi Christian

I prepared this patch because we met a page fault after gpu reset in SRIOV 
triggered by quark.
After investigation, I found that the page table did not get restored after gpu 
reset.
I just tried to use vm_update_mode=0 to disable VM update by CPU, and this 
issue still exists.

CPU based page table updates are not supported under SRIOV.


-[Christian] When VRAM is lost any submission using the VM entity will fail and 
so result in a new page table generation.
I believe that you are referring this piece of code in function 
amdgpu_job_run():
         /* Skip job if VRAM is lost and never resubmit gangs */
         if (job->generation != amdgpu_vm_generation(adev, job->vm) ||
             (job->job_run_counter && job->gang_submit))
                 dma_fence_set_error(finished, -ECANCELED);

I agree that the submission from the old ctx will be set as ECANCELED and be 
skipped.
But if the application then creates a new ctx and submit a new job, both 
new_ctx->generation and new_job->generation will be initiated as the updated 
one.
         ctx->generation = amdgpu_vm_generation(mgr->adev, &fpriv->vm);
         (*job)->generation = amdgpu_vm_generation(adev, vm);
And the job will be executed, hence there will be a page fault due to VRAM lost.

Please correct me if I have some misunderstanding.
I still can not see why any submission using the VM entity will fail due to 
VRAM lost.

Ah! My assumption was that you will also always have a VM page table job which would be canceled.

But that isn't the case, instead you have only a context which is re-created.

In that case the approach here is valid, but let me comment on the patch itself.

Regards,
Christian.


Best,
Zhenguo
Cloud-GPU Core team, SRDC

-----Original Message-----
From: Christian König <[email protected]>
Sent: Tuesday, July 23, 2024 3:13 PM
To: Yin, ZhenGuo (Chris) <[email protected]>; [email protected]
Cc: Koenig, Christian <[email protected]>
Subject: Re: [PATCH v2] drm/amdgpu: reset vm state machine after gpu reset(vram 
lost)

Am 23.07.24 um 05:05 schrieb ZhenGuo Yin:
[Why]
Page table of compute VM in the VRAM will lost after gpu reset.
VRAM won't be restored since compute VM has no shadows.

[How]
Use higher 32-bit of vm->generation to record a vram_lost_counter.
Reset the VM state machine when vm->genertaion is not equal to
re-generation token.

v2: Check vm->generation instead of calling drm_sched_entity_error in
amdgpu_vm_validate.
I've just double checked the logic again and as far as I can see this patch 
here is still completely superfluous.

When VRAM is lost any submission using the VM entity will fail and so result in 
a new page table generation.

What isn't handled are CPU based page table updates, but for those we could 
potentially change the condition inside the CPU page tables code.

Regards,
Christian.

Signed-off-by: ZhenGuo Yin <[email protected]>
---
   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c | 11 +++++++----
   1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
index 3abfa66d72a2..9e2f84c166e7 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.c
@@ -434,7 +434,7 @@ uint64_t amdgpu_vm_generation(struct amdgpu_device *adev, 
struct amdgpu_vm *vm)
       if (!vm)
               return result;

-     result += vm->generation;
+     result += (vm->generation & 0xFFFFFFFF);
       /* Add one if the page tables will be re-generated on next CS */
       if (drm_sched_entity_error(&vm->delayed))
               ++result;
@@ -467,9 +467,12 @@ int amdgpu_vm_validate(struct amdgpu_device *adev, struct 
amdgpu_vm *vm,
       struct amdgpu_bo *shadow;
       struct amdgpu_bo *bo;
       int r;
+     uint32_t vram_lost_counter = atomic_read(&adev->vram_lost_counter);

-     if (drm_sched_entity_error(&vm->delayed)) {
-             ++vm->generation;
+     if (vm->generation != amdgpu_vm_generation(adev, vm)) {
+             if (drm_sched_entity_error(&vm->delayed))
+                     ++vm->generation;
+             vm->generation = (u64)vram_lost_counter << 32 | (vm->generation &
+0xFFFFFFFF);
               amdgpu_vm_bo_reset_state_machine(vm);
               amdgpu_vm_fini_entities(vm);
               r = amdgpu_vm_init_entities(adev, vm); @@ -2439,7 +2442,7 @@ int
amdgpu_vm_init(struct amdgpu_device *adev, struct amdgpu_vm *vm,
       vm->last_update = dma_fence_get_stub();
       vm->last_unlocked = dma_fence_get_stub();
       vm->last_tlb_flush = dma_fence_get_stub();
-     vm->generation = 0;
+     vm->generation = (u64)atomic_read(&adev->vram_lost_counter) << 32;

       mutex_init(&vm->eviction_lock);
       vm->evicting = false;

Reply via email to