vm in amdgpu_job_prepare_job

YuanShang Mao (River) Tue, 29 Jul 2025 20:39:18 -0700

[AMD Official Use Only - AMD Internal Distribution Only]

> Can you include that in the patch description?
Sure.  This will be the final commit message.


    drm/amdgpu: Retain job->vm in amdgpu_job_prepare_job

    The field job->vm is used in function amdgpu_job_run to get the page
    table re-generation counter and decide whether the job should be skipped.

    Specifically, function amdgpu_vm_generation checks if the VM is valid for 
this job to use.
    For instance, if a gfx job depends on a cancelled sdma job from entity 
vm->delayed,
    then the gfx job should be skipped.

    Signed-off-by: YuanShang <[email protected]>
    Reviewed-by: Alex Deucher <[email protected]>

Thanks
River
-----Original Message-----
From: Alex Deucher <[email protected]>
Sent: Tuesday, July 29, 2025 9:08 PM
To: YuanShang Mao (River) <[email protected]>
Cc: Deucher, Alexander <[email protected]>; Koenig, Christian 
<[email protected]>; [email protected]; cao, lin 
<[email protected]>; Zhang, Tiantian (Celine) <[email protected]>
Subject: Re: [PATCH] drm/amdgpu: keep job->vm in amdgpu_job_prepare_job

On Mon, Jul 28, 2025 at 10:41 PM YuanShang Mao (River) <[email protected]> 
wrote:
>
> [AMD Official Use Only - AMD Internal Distribution Only]
>
> Hi Alexander
>
> > I didn't think we actually resubmitting jobs anymore.  How are we ending up 
> > trying to resubmit in the first place?
>
> It is not about resubmitting. amdgpu_vm_generation will check if the VM is 
> valid for this job to use. For example, if a gfx job depends on an sdma job, 
> which is already cancelled, then the gfx job should be skipped.

Ah, that makes sense.  Can you include that in the patch description?
With that included, the patch is
Reviewed-by: Alex Deucher <[email protected]>

> Perhaps the dependencies between tasks should be resolved by the drm instead 
> of amdgpu.

If we can do that or check the dependencies via the job itself that would be 
better in the long term.

Alex

>
> uint64_t amdgpu_vm_generation(struct amdgpu_device *adev, struct
> amdgpu_vm *vm) {
>         uint64_t result = (u64)atomic_read(&adev->vram_lost_counter)
> << 32;
>
>         if (!vm)
>                 return result;
>
>         result += lower_32_bits(vm->generation);
>         /* Add one if the page tables will be re-generated on next CS */
>         if (drm_sched_entity_error(&vm->delayed))
>                 ++result;
>
>         return result;
> }
>
> Thanks
> River
>
>
> -----Original Message-----
> From: Alex Deucher <[email protected]>
> Sent: Tuesday, July 29, 2025 2:10 AM
> To: YuanShang Mao (River) <[email protected]>
> Cc: Deucher, Alexander <[email protected]>; Koenig, Christian
> <[email protected]>; [email protected]; cao, lin
> <[email protected]>; Zhang, Tiantian (Celine) <[email protected]>
> Subject: Re: [PATCH] drm/amdgpu: keep job->vm in
> amdgpu_job_prepare_job
>
> On Mon, Jul 28, 2025 at 5:01 AM YuanShang Mao (River) <[email protected]> 
> wrote:
> >
> > [AMD Official Use Only - AMD Internal Distribution Only]
> >
> > Hi Alexander
> >
> >         Since Christian is on vacation. Could you help review the below 
> > patch?
> >         If set job->vm to null in amdgpu_job_prepare_job, the job which 
> > should be skipped in amdgpu_job_run will be submitted unexpectedly.
>
> I think we can just remove the amdgpu_vm_generation() check in 
> amdgpu_job_run().  I didn't think we actually resubmitting jobs anymore.  How 
> are we ending up trying to resubmit in the first place?
>
> Alex
>
> >
> > Thanks
> > River
> >
> >
> > -----Original Message-----
> > From: YuanShang Mao (River) <[email protected]>
> > Sent: Wednesday, July 23, 2025 5:06 PM
> > To: [email protected]
> > Cc: Koenig, Christian <[email protected]>; YuanShang Mao
> > (River) <[email protected]>
> > Subject: [PATCH] drm/amdgpu: keep job->vm in amdgpu_job_prepare_job
> >
> > job->vm is used in function amdgpu_job_run to get the page
> > table re-generation counter and decide whether the job should be skipped.
> >
> > Signed-off-by: YuanShang <[email protected]>
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c | 7 -------
> >  1 file changed, 7 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > index 45febdc2f349..18998343815e 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> > @@ -360,13 +360,6 @@ amdgpu_job_prepare_job(struct drm_sched_job *sched_job,
> >                         dev_err(ring->adev->dev, "Error getting VM ID 
> > (%d)\n", r);
> >                         goto error;
> >                 }
> > -               /*
> > -                * The VM structure might be released after the VMID is
> > -                * assigned, we had multiple problems with people trying to 
> > use
> > -                * the VM pointer so better set it to NULL.
> > -                */
> > -               if (!fence)
> > -                       job->vm = NULL;
> >                 return fence;
> >         }
> >
> > --
> > 2.25.1
> >

RE: [PATCH] drm/amdgpu: keep job->vm in amdgpu_job_prepare_job

Reply via email to