On 14/10/2025 11:27, Philipp Stanner wrote:
On Wed, 2025-10-08 at 09:53 +0100, Tvrtko Ursulin wrote:
Fair scheduling policy is built upon the same concepts as the well known
nit: "The fair …"
Or maybe better: call it FAIR, being congruent with the FIFO below.
CFS kernel scheduler - entity run queue is sorted by the virtual GPU time
nit: Call it "CPU scheduler". The GPU scheduler is a kernel scheduler,
too.
consumed by entities in a way that the entity with least vruntime runs
first.
It is able to avoid total priority starvation, which is one of the
problems with FIFO, and it also does not need for per priority run queues.
As it scales the actual GPU runtime by an exponential factor as the
priority decreases, therefore the virtual runtime for low priority
"therefore," is not necessary because of the sentence starting with
"As"
Done x3 above.
entities grows faster than for normal priority, pushing them further down
the runqueue order for the same real GPU time spent.
Apart from this fundamental fairness, fair policy is especially strong in
oversubscription workloads where it is able to give more GPU time to short
and bursty workloads when they are running in parallel with GPU heavy
clients submitting deep job queues.
Signed-off-by: Tvrtko Ursulin <[email protected]>
Cc: Christian König <[email protected]>
Cc: Danilo Krummrich <[email protected]>
Cc: Matthew Brost <[email protected]>
Cc: Philipp Stanner <[email protected]>
Cc: Pierre-Eric Pelloux-Prayer <[email protected]>
---
drivers/gpu/drm/scheduler/sched_entity.c | 28 ++--
drivers/gpu/drm/scheduler/sched_internal.h | 9 +-
drivers/gpu/drm/scheduler/sched_main.c | 12 +-
drivers/gpu/drm/scheduler/sched_rq.c | 147 ++++++++++++++++++++-
include/drm/gpu_scheduler.h | 16 ++-
5 files changed, 191 insertions(+), 21 deletions(-)
diff --git a/drivers/gpu/drm/scheduler/sched_entity.c
b/drivers/gpu/drm/scheduler/sched_entity.c
index 04ce8b7d436b..58f51875547a 100644
--- a/drivers/gpu/drm/scheduler/sched_entity.c
+++ b/drivers/gpu/drm/scheduler/sched_entity.c
@@ -108,6 +108,8 @@ int drm_sched_entity_init(struct drm_sched_entity *entity,
entity->guilty = guilty;
entity->num_sched_list = num_sched_list;
entity->priority = priority;
+ entity->rq_priority = drm_sched_policy == DRM_SCHED_POLICY_FAIR ?
+ DRM_SCHED_PRIORITY_KERNEL : priority;
/*
* It's perfectly valid to initialize an entity without having a valid
* scheduler attached. It's just not valid to use the scheduler before
it
@@ -124,17 +126,23 @@ int drm_sched_entity_init(struct drm_sched_entity *entity,
*/
pr_warn("%s: called with uninitialized scheduler\n", __func__);
} else if (num_sched_list) {
- /* The "priority" of an entity cannot exceed the number of
run-queues of a
- * scheduler. Protect against num_rqs being 0, by converting to
signed. Choose
- * the lowest priority available.
+ enum drm_sched_priority p = entity->priority;
+
+ /*
+ * The "priority" of an entity cannot exceed the number of
+ * run-queues of a scheduler. Protect against num_rqs being 0,
+ * by converting to signed. Choose the lowest priority
+ * available.
*/
- if (entity->priority >= sched_list[0]->num_rqs) {
- dev_err(sched_list[0]->dev, "entity has out-of-bounds
priority: %u. num_rqs: %u\n",
- entity->priority, sched_list[0]->num_rqs);
- entity->priority = max_t(s32, (s32)
sched_list[0]->num_rqs - 1,
- (s32)
DRM_SCHED_PRIORITY_KERNEL);
+ if (p >= sched_list[0]->num_user_rqs) {
+ dev_err(sched_list[0]->dev, "entity with out-of-bounds
priority:%u num_user_rqs:%u\n",
+ p, sched_list[0]->num_user_rqs);
+ p = max_t(s32,
+ (s32)sched_list[0]->num_user_rqs - 1,
+ (s32)DRM_SCHED_PRIORITY_KERNEL);
+ entity->priority = p;
}
- entity->rq = sched_list[0]->sched_rq[entity->priority];
+ entity->rq = sched_list[0]->sched_rq[entity->rq_priority];
That rename could be a separate patch, couldn't it? As I said before
it's always great to have general code improvements as separate patches
since it makes it far easier to review (i.e.: detect / see) core
functionality changes.
No, this is the new struct member only added in this patch.
}
init_completion(&entity->entity_idle);
@@ -567,7 +575,7 @@ void drm_sched_entity_select_rq(struct drm_sched_entity
*entity)
spin_lock(&entity->lock);
sched = drm_sched_pick_best(entity->sched_list, entity->num_sched_list);
- rq = sched ? sched->sched_rq[entity->priority] : NULL;
+ rq = sched ? sched->sched_rq[entity->rq_priority] : NULL;
if (rq != entity->rq) {
drm_sched_rq_remove_entity(entity->rq, entity);
entity->rq = rq;
diff --git a/drivers/gpu/drm/scheduler/sched_internal.h
b/drivers/gpu/drm/scheduler/sched_internal.h
index 1132a771aa37..c94e38acc6f2 100644
--- a/drivers/gpu/drm/scheduler/sched_internal.h
+++ b/drivers/gpu/drm/scheduler/sched_internal.h
@@ -18,18 +18,23 @@
* @kref: reference count for the object.
* @lock: lock guarding the @runtime updates.
* @runtime: time entity spent on the GPU.
+ * @prev_runtime: previous @runtime used to get the runtime delta
+ * @vruntime: virtual runtime as accumulated by the fair algorithm
The other docstrings are all terminated with a full stop '.'
Yep I fixed the whole series in this respect already as response to one
of your earlier comments.
*/
struct drm_sched_entity_stats {
struct kref kref;
spinlock_t lock;
ktime_t runtime;
+ ktime_t prev_runtime;
+ u64 vruntime;
};
/* Used to choose between FIFO and RR job-scheduling */
extern int drm_sched_policy;
-#define DRM_SCHED_POLICY_RR 0
-#define DRM_SCHED_POLICY_FIFO 1
+#define DRM_SCHED_POLICY_RR 0
+#define DRM_SCHED_POLICY_FIFO 1
+#define DRM_SCHED_POLICY_FAIR 2
Formatting unnecessarily increases the git diff.
Let's die the death of having the old formatting. As far as it's
forseeable FAIR will be the last policy for the classic drm_sched
anyways, so no future changes here expected.
Strange I thought I fixed this already in the previous respin. Re-fixed
and verfied.
bool drm_sched_can_queue(struct drm_gpu_scheduler *sched,
struct drm_sched_entity *entity);
diff --git a/drivers/gpu/drm/scheduler/sched_main.c
b/drivers/gpu/drm/scheduler/sched_main.c
index f180d292bf66..8d8f9c8411f5 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -90,7 +90,7 @@ int drm_sched_policy = DRM_SCHED_POLICY_FIFO;
* DOC: sched_policy (int)
* Used to override default entities scheduling policy in a run queue.
*/
-MODULE_PARM_DESC(sched_policy, "Specify the scheduling policy for entities on a run-queue, "
__stringify(DRM_SCHED_POLICY_RR) " = Round Robin, " __stringify(DRM_SCHED_POLICY_FIFO) " =
FIFO (default).");
+MODULE_PARM_DESC(sched_policy, "Specify the scheduling policy for entities on a run-queue, "
__stringify(DRM_SCHED_POLICY_RR) " = Round Robin, " __stringify(DRM_SCHED_POLICY_FIFO) " = FIFO, "
__stringify(DRM_SCHED_POLICY_FAIR) " = Fair (default).");
module_param_named(sched_policy, drm_sched_policy, int, 0444);
static u32 drm_sched_available_credits(struct drm_gpu_scheduler *sched)
@@ -1132,11 +1132,15 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
const struct drm_sched_init_
sched->own_submit_wq = true;
}
- sched->sched_rq = kmalloc_array(args->num_rqs, sizeof(*sched->sched_rq),
+ sched->num_user_rqs = args->num_rqs;
+ sched->num_rqs = drm_sched_policy != DRM_SCHED_POLICY_FAIR ?
+ args->num_rqs : 1;
+ sched->sched_rq = kmalloc_array(sched->num_rqs,
+ sizeof(*sched->sched_rq),
Don't reformat that for the git diff? Line doesn't seem crazily long.
Ok.
GFP_KERNEL | __GFP_ZERO);
if (!sched->sched_rq)
goto Out_check_own;
- sched->num_rqs = args->num_rqs;
+
for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) {
sched->sched_rq[i] = kzalloc(sizeof(*sched->sched_rq[i]),
GFP_KERNEL);
if (!sched->sched_rq[i])
@@ -1278,7 +1282,7 @@ void drm_sched_increase_karma(struct drm_sched_job *bad)
if (bad->s_priority != DRM_SCHED_PRIORITY_KERNEL) {
atomic_inc(&bad->karma);
- for (i = DRM_SCHED_PRIORITY_HIGH; i < sched->num_rqs; i++) {
+ for (i = DRM_SCHED_PRIORITY_KERNEL; i < sched->num_rqs; i++) {
Give me a pointer here quickly – what's that about?
Since FAIR stuffs everthing into a single run queue it needs to start
looking into it when looking for the guilty context. FIFO and RR are not
affected since they will not find the context with the kernel priority
in the kernel run queue anyway.
struct drm_sched_rq *rq = sched->sched_rq[i];
spin_lock(&rq->lock);
diff --git a/drivers/gpu/drm/scheduler/sched_rq.c
b/drivers/gpu/drm/scheduler/sched_rq.c
index 09d316bc3dfa..b868c794cc9d 100644
--- a/drivers/gpu/drm/scheduler/sched_rq.c
+++ b/drivers/gpu/drm/scheduler/sched_rq.c
@@ -16,6 +16,24 @@ drm_sched_entity_compare_before(struct rb_node *a, const
struct rb_node *b)
return ktime_before(ea->oldest_job_waiting, eb->oldest_job_waiting);
}
+static void drm_sched_rq_update_prio(struct drm_sched_rq *rq)
+{
+ enum drm_sched_priority prio = -1;
+ struct rb_node *rb;
nit:
"node" might be a bitter name than rb. When iterating over a list we
also typically call the iterator sth like "head" and not "list".
But no hard feelings on that change.
I am following the convention from drm_sched_rq_select_entity_fifo() to
avoid someone complaining I was diverging from the pattern established
in the same file. ;)
+
+ lockdep_assert_held(&rq->lock);
+
+ rb = rb_first_cached(&rq->rb_tree_root);
+ if (rb) {
+ struct drm_sched_entity *entity =
+ rb_entry(rb, typeof(*entity), rb_tree_node);
+
+ prio = entity->priority; /* Unlocked read */
Why an unlocked read? Why is that OK? The comment could detail that.
Fair point, expanded the explanation.
+ }
+
+ rq->head_prio = prio;
+}
+
static void drm_sched_rq_remove_fifo_locked(struct drm_sched_entity *entity,
struct drm_sched_rq *rq)
{
@@ -25,6 +43,7 @@ static void drm_sched_rq_remove_fifo_locked(struct
drm_sched_entity *entity,
if (!RB_EMPTY_NODE(&entity->rb_tree_node)) {
rb_erase_cached(&entity->rb_tree_node, &rq->rb_tree_root);
RB_CLEAR_NODE(&entity->rb_tree_node);
+ drm_sched_rq_update_prio(rq);
}
}
@@ -46,6 +65,7 @@ static void drm_sched_rq_update_fifo_locked(struct drm_sched_entity *entity,
rb_add_cached(&entity->rb_tree_node, &rq->rb_tree_root,
drm_sched_entity_compare_before);
+ drm_sched_rq_update_prio(rq);
}
/**
@@ -63,6 +83,114 @@ void drm_sched_rq_init(struct drm_sched_rq *rq,
INIT_LIST_HEAD(&rq->entities);
rq->rb_tree_root = RB_ROOT_CACHED;
rq->sched = sched;
+ rq->head_prio = -1;
head_prio is an enum.
Better to give the enum an entry like:
PRIO_INVALID = -1,
Ok.
+}
+
+static ktime_t
+drm_sched_rq_get_min_vruntime(struct drm_sched_rq *rq)
+{
+ struct drm_sched_entity *entity;
+ struct rb_node *rb;
+
+ lockdep_assert_held(&rq->lock);
+
+ for (rb = rb_first_cached(&rq->rb_tree_root); rb; rb = rb_next(rb)) {
+ entity = rb_entry(rb, typeof(*entity), rb_tree_node);
+
+ return entity->stats->vruntime; /* Unlocked read */
Seems the read is unlocked because we just don't care about it racing?
If there is a platform which tears ktime_t writes I suppose this could
read garbage. I am not sure if there is. Perhaps safer to add the lock
around it nevertheless.