GitHub user Nataneljpwd added a comment to the discussion: Redesign the scheduler logic to avoid starvation due to dropped tasks in critical section
It looks like the simpler solution here would be to design the query in such a way that it always gets max_tis tasks that can be sent to run, this can be done by windowing over pools, which will solve the issue linked, however, I think that the scheduler prioritization should also be changed, as if we have the same situation as in the issue but on the same pool, we will get the same result. I thought about changing the prioritization to a weight, so that the lower prioritized tasks will also get to run, for example: Assume: Dag A - 1000 tasks with priority 5 concurrency 10 Dag B - 100 tasks with priority 2 concurrency 3 Same pool, short tasks As of now, airflow will first complete running Dag A before it starts running Dag B as the query will return 32 tasks with the highest priority and Dag B will starve. If we change the priority to be a weight, meaning that Dag A will get 5 slots for every 2 slots Dag B gets, both Dags will get to run at the same time and will complete faster overall, this will require a window function over pools and priorities. There are some edge cases, as the example above but with priorities of 100 and 1, this can be solved by trying to give at least 1 slot for each priority, in which case, what if we have more than 32 priorities? Choose the top 32 and work only on them? A possible solution is to add a configuration to allow for a maximum amount of tasks scheduler for top priority or just ignore the numbers itself and go of the largest, second largest and so on, while splitting the tasks available as fairly as possible according to the priority (which has problems of its own as what if we decided to schedule 4 tasks for given priority but there are only 2 tasks to schedule? Give it to the next in line? Or to the most prioritized?) The implementation could be made fully configurable and with a strategy pattern for the leftover slots, however it might not be the best option as it will add complexity to the system (ideally the system will dynamically decide what to do). I would love to hear any suggestions you might have either for simplification or improvement. GitHub link: https://github.com/apache/airflow/discussions/49160#discussioncomment-12817885 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
