GitHub user Nataneljpwd added a comment to the discussion: Redesign the 
scheduler logic to avoid starvation due to dropped tasks in critical section

It looks like the simpler solution here would be to design the query in such a 
way that it always gets max_tis tasks that can be sent to run, this can be done 
by windowing over pools, which will solve the issue linked, however, I think 
that the scheduler prioritization should also be changed, as if we have the 
same situation as in the issue but on the same pool, we will get the same 
result.

I thought about changing the prioritization to a weight, so that the lower 
prioritized tasks will also get to run, for example:

Assume:
Dag A - 1000 tasks with priority 5 concurrency 10
Dag B - 100 tasks with priority 2 concurrency 3
Same pool, short tasks

As of now, airflow will first complete running Dag A before it starts running 
Dag B as the query will return 32 tasks with the highest priority and Dag B 
will starve.

If we change the priority to be a weight, meaning that Dag A will get 5 slots 
for every 2 slots Dag B gets, both Dags will get to run at the same time and 
will complete faster overall, this will require a window function over pools 
and priorities.

There are some edge cases, as the example above but with priorities of 100 and 
1, this can be solved by trying to give at least 1 slot for each priority, in 
which case, what if we have more than 32 priorities? Choose the top 32 and work 
only on them?

A possible solution is to add a configuration to allow for a maximum amount of 
tasks scheduler for top priority or just ignore the numbers itself and go of 
the largest, second largest and so on, while splitting the tasks available as 
fairly as possible according to the priority (which has problems of its own as 
what if we decided to schedule 4 tasks for given priority but there are only 2 
tasks to schedule? Give it to the next in line? Or to the most prioritized?)
The implementation could be made fully configurable and with a strategy pattern 
for the leftover slots, however it might not be the best option as it will add 
complexity to the system (ideally the system will dynamically decide what to 
do).

I would love to hear any suggestions you might have either for simplification 
or improvement.

GitHub link: 
https://github.com/apache/airflow/discussions/49160#discussioncomment-12817885

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to