On 2017-10-20 02:24 PM, Christian König wrote:
Am 20.10.2017 um 18:51 schrieb Andrey Grodzovsky:


On 2017-10-20 12:26 PM, Andres Rodriguez wrote:


On 2017-10-20 12:19 PM, Andrey Grodzovsky wrote:


On 2017-10-20 11:59 AM, Andres Rodriguez wrote:


On 2017-10-20 09:32 AM, Andrey Grodzovsky wrote:
Bug: amdgpu_job_free_cb was accessing s_job->s_entity when the allocated
amdgpu_ctx (and the entity inside it) were already deallocated from
amdgpu_cs_parser_fini.

Fix: Save job's priority on it's creation instead of accessing it from
s_entity later on.

I'm not sure if this is the correct approach for a fix.

Keeping s_entity as a dangling pointer could result in a similar bugs being reintroduced. For example, there would still be a race condition between amdgpu_cs_parser_fini() and amdgpu_job_dependency().

.dependency hook is called only in once place amd_sched_entity_pop_job, amdgpu_cs_parser_fini will wait (from amd_sched_entity_fini) for wake_up(&sched->job_scheduled) from amd_sched_main so I don't see a race here.



Instead, it might be better for the job to grab a reference to the context during job_init(), and put that reference on job free.

Originally it was my thinkimg to, but I consulted Christian and he advised that quote - "it's not the best idea since the problem is that when we terminate a process we need to make sure that all resources are released or at least not hold for much longer. When we keep the ctx alive with the job we need to also keep the VM alive and that means we need to keep all the VM page tables alive".


That makes sense.

Since s_entity is tied to the context reference held by the parser, can you set it to NULL when you drop the context reference there?

O am not sure i understand - you want to send s_job->s_entity to NULL in amd_sched_entity_fini for each remaining job in the queue ? But all the jobs remaining in the queue are destroyed there anyway.

I think what Andres means here is exactly what we planned to do anyway. Set job->s_entity to NULL as soon as we know that the entity is not used any more and might be released.

Yeah this is what I would like to see. If you already have discussed it and have a plan to address it, then this patch looks good to me for static and dynamic priorities.

Feel free to add:
Reviewed-by: Andres Rodriguez <[email protected]>


In the long term we should target towards making s_job->s_entity as well as job->vm superfluous. This way we could even push remaining jobs on a graveyard entity when we destroy one and timeout.

Alternatively we could look into why wait_event_killable is sometimes not killable as the name says :)

Maybe we can get to a point where we can finally reboot the system cleanly even when the GPU is stuck.

Regards,
Christian.


Thanks,
Andrey


At least that way we can easily detect misuse of s_entity after it enters a "possibly deleted" state.

Regards,
Andres

Thanks,
Andrey


Regards,
Andres


Signed-off-by: Andrey Grodzovsky <[email protected]>
---
  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c        |  3 +--
  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c       |  5 ++---
  drivers/gpu/drm/amd/scheduler/gpu_scheduler.c |  1 +
  drivers/gpu/drm/amd/scheduler/gpu_scheduler.h | 32 ++++++++++++---------------
  4 files changed, 18 insertions(+), 23 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index f7fceb6..a760b6e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -1192,8 +1192,7 @@ static int amdgpu_cs_submit(struct amdgpu_cs_parser *p,
      job->uf_sequence = seq;
        amdgpu_job_free_resources(job);
-    amdgpu_ring_priority_get(job->ring,
- amd_sched_get_job_priority(&job->base));
+    amdgpu_ring_priority_get(job->ring, job->base.s_priority);
        trace_amdgpu_cs_ioctl(job);
      amd_sched_entity_push_job(&job->base);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index 0cfc68d..a58e3c5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -104,7 +104,7 @@ static void amdgpu_job_free_cb(struct amd_sched_job *s_job)
  {
      struct amdgpu_job *job = container_of(s_job, struct amdgpu_job, base);   -    amdgpu_ring_priority_put(job->ring, amd_sched_get_job_priority(s_job));
+    amdgpu_ring_priority_put(job->ring, s_job->s_priority);
      dma_fence_put(job->fence);
      amdgpu_sync_free(&job->sync);
      amdgpu_sync_free(&job->dep_sync);
@@ -141,8 +141,7 @@ int amdgpu_job_submit(struct amdgpu_job *job, struct amdgpu_ring *ring,
      job->fence_ctx = entity->fence_context;
      *f = dma_fence_get(&job->base.s_fence->finished);
      amdgpu_job_free_resources(job);
-    amdgpu_ring_priority_get(job->ring,
- amd_sched_get_job_priority(&job->base));
+    amdgpu_ring_priority_get(job->ring, job->base.s_priority);
      amd_sched_entity_push_job(&job->base);
        return 0;
diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
index e4d3b4e..1bbbce2 100644
--- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
+++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
@@ -529,6 +529,7 @@ int amd_sched_job_init(struct amd_sched_job *job,
  {
      job->sched = sched;
      job->s_entity = entity;
+    job->s_priority = entity->rq - sched->sched_rq;
      job->s_fence = amd_sched_fence_create(entity, owner);
      if (!job->s_fence)
          return -ENOMEM;
diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
index 52c8e54..3f75b45 100644
--- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
+++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
@@ -30,6 +30,19 @@
  struct amd_gpu_scheduler;
  struct amd_sched_rq;
  +enum amd_sched_priority {
+    AMD_SCHED_PRIORITY_MIN,
+    AMD_SCHED_PRIORITY_LOW = AMD_SCHED_PRIORITY_MIN,
+    AMD_SCHED_PRIORITY_NORMAL,
+    AMD_SCHED_PRIORITY_HIGH_SW,
+    AMD_SCHED_PRIORITY_HIGH_HW,
+    AMD_SCHED_PRIORITY_KERNEL,
+    AMD_SCHED_PRIORITY_MAX,
+    AMD_SCHED_PRIORITY_INVALID = -1,
+    AMD_SCHED_PRIORITY_UNSET = -2
+};
+
+
  /**
   * A scheduler entity is a wrapper around a job queue or a group
   * of other entities. Entities take turns emitting jobs from their
@@ -83,6 +96,7 @@ struct amd_sched_job {
      struct delayed_work        work_tdr;
      uint64_t            id;
      atomic_t karma;
+    enum amd_sched_priority s_priority;
  };
    extern const struct dma_fence_ops amd_sched_fence_ops_scheduled;
@@ -114,18 +128,6 @@ struct amd_sched_backend_ops {
      void (*free_job)(struct amd_sched_job *sched_job);
  };
  -enum amd_sched_priority {
-    AMD_SCHED_PRIORITY_MIN,
-    AMD_SCHED_PRIORITY_LOW = AMD_SCHED_PRIORITY_MIN,
-    AMD_SCHED_PRIORITY_NORMAL,
-    AMD_SCHED_PRIORITY_HIGH_SW,
-    AMD_SCHED_PRIORITY_HIGH_HW,
-    AMD_SCHED_PRIORITY_KERNEL,
-    AMD_SCHED_PRIORITY_MAX,
-    AMD_SCHED_PRIORITY_INVALID = -1,
-    AMD_SCHED_PRIORITY_UNSET = -2
-};
-
  /**
   * One scheduler is implemented for each hardware ring
  */
@@ -176,10 +178,4 @@ bool amd_sched_dependency_optimized(struct dma_fence* fence,
                      struct amd_sched_entity *entity);
  void amd_sched_job_kickout(struct amd_sched_job *s_job);
  -static inline enum amd_sched_priority
-amd_sched_get_job_priority(struct amd_sched_job *job)
-{
-    return (job->s_entity->rq - job->sched->sched_rq);
-}
-
  #endif



_______________________________________________
amd-gfx mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/amd-gfx


_______________________________________________
amd-gfx mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Reply via email to