[AMD Official Use Only - AMD Internal Distribution Only]
> Can you include that in the patch description?
Sure. This will be the final commit message.
drm/amdgpu: Retain job->vm in amdgpu_job_prepare_job
The field job->vm is used in function amdgpu_job_run to get the page
table re-generation counter and decide whether the job should be skipped.
Specifically, function amdgpu_vm_generation checks if the VM is valid for
this job to use.
For instance, if a gfx job depends on a cancelled sdma job from entity
vm->delayed,
then the gfx job should be skipped.
Signed-off-by: YuanShang <[email protected]>
Reviewed-by: Alex Deucher <[email protected]>
Thanks
River
-----Original Message-----
From: Alex Deucher <[email protected]>
Sent: Tuesday, July 29, 2025 9:08 PM
To: YuanShang Mao (River) <[email protected]>
Cc: Deucher, Alexander <[email protected]>; Koenig, Christian
<[email protected]>; [email protected]; cao, lin
<[email protected]>; Zhang, Tiantian (Celine) <[email protected]>
Subject: Re: [PATCH] drm/amdgpu: keep job->vm in amdgpu_job_prepare_job
On Mon, Jul 28, 2025 at 10:41 PM YuanShang Mao (River) <[email protected]>
wrote:
>
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> Hi Alexander
>
> > I didn't think we actually resubmitting jobs anymore. How are we ending up
> > trying to resubmit in the first place?
>
> It is not about resubmitting. amdgpu_vm_generation will check if the VM is
> valid for this job to use. For example, if a gfx job depends on an sdma job,
> which is already cancelled, then the gfx job should be skipped.
Ah, that makes sense. Can you include that in the patch description?
With that included, the patch is
Reviewed-by: Alex Deucher <[email protected]>
> Perhaps the dependencies between tasks should be resolved by the drm instead
> of amdgpu.
If we can do that or check the dependencies via the job itself that would be
better in the long term.
Alex
>
> uint64_t amdgpu_vm_generation(struct amdgpu_device *adev, struct
> amdgpu_vm *vm) {
> uint64_t result = (u64)atomic_read(&adev->vram_lost_counter)
> << 32;
>
> if (!vm)
> return result;
>
> result += lower_32_bits(vm->generation);
> /* Add one if the page tables will be re-generated on next CS */
> if (drm_sched_entity_error(&vm->delayed))
> ++result;
>
> return result;
> }
>
> Thanks
> River
>
>
> -----Original Message-----
> From: Alex Deucher <[email protected]>
> Sent: Tuesday, July 29, 2025 2:10 AM
> To: YuanShang Mao (River) <[email protected]>
> Cc: Deucher, Alexander <[email protected]>; Koenig, Christian
> <[email protected]>; [email protected]; cao, lin
> <[email protected]>; Zhang, Tiantian (Celine) <[email protected]>
> Subject: Re: [PATCH] drm/amdgpu: keep job->vm in
> amdgpu_job_prepare_job
>
> On Mon, Jul 28, 2025 at 5:01 AM YuanShang Mao (River) <[email protected]>
> wrote:
> >
> > [AMD Official Use Only - AMD Internal Distribution Only]
> >
> > Hi Alexander
> >
> > Since Christian is on vacation. Could you help review the below
> > patch?
> > If set job->vm to null in amdgpu_job_prepare_job, the job which
> > should be skipped in amdgpu_job_run will be submitted unexpectedly.
>
> I think we can just remove the amdgpu_vm_generation() check in
> amdgpu_job_run(). I didn't think we actually resubmitting jobs anymore. How
> are we ending up trying to resubmit in the first place?
>
> Alex
>
> >
> > Thanks
> > River
> >
> >
> > -----Original Message-----
> > From: YuanShang Mao (River) <[email protected]>
> > Sent: Wednesday, July 23, 2025 5:06 PM
> > To: [email protected]
> > Cc: Koenig, Christian <[email protected]>; YuanShang Mao
> > (River) <[email protected]>
> > Subject: [PATCH] drm/amdgpu: keep job->vm in amdgpu_job_prepare_job
> >
> > job->vm is used in function amdgpu_job_run to get the page
> > table re-generation counter and decide whether the job should be skipped.
> >
> > Signed-off-by: YuanShang <[email protected]>
> > ---
> > drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 7 -------
> > 1 file changed, 7 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > index 45febdc2f349..18998343815e 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > @@ -360,13 +360,6 @@ amdgpu_job_prepare_job(struct drm_sched_job *sched_job,
> > dev_err(ring->adev->dev, "Error getting VM ID
> > (%d)\n", r);
> > goto error;
> > }
> > - /*
> > - * The VM structure might be released after the VMID is
> > - * assigned, we had multiple problems with people trying to
> > use
> > - * the VM pointer so better set it to NULL.
> > - */
> > - if (!fence)
> > - job->vm = NULL;
> > return fence;
> > }
> >
> > --
> > 2.25.1
> >