[
https://issues.apache.org/jira/browse/AIRFLOW-3607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17198270#comment-17198270
]
ASF GitHub Bot commented on AIRFLOW-3607:
-----------------------------------------
yuqian90 opened a new pull request #11010:
URL: https://github.com/apache/airflow/pull/11010
This is cherry-picked from #4751 for `v1-10-test`.
Some investigation into the issue reported in #10790 led to the discovery
that this loop in `scheduler_job.py` takes almost 90% of the time in
`SchedulerJob.process_file()` for large DAGs (around 500 tasks). This causes
the `DagFileProcessor` spawned by the scheduler to go slowly. The reason this
loop is slow is that it creates a new `DepContext` for every `ti`. And every
`DepContext` needs to populate its own `finished_tasks` even though this list
is the same for every `DagRun`.
```python
for ti in tis:
task = dag.get_task(ti.task_id)
# fixme: ti.task is transient but needs to be set
ti.task = task
if ti.are_dependencies_met(
dep_context=DepContext(flag_upstream_failed=True),
session=session):
self.log.debug('Queuing task: %s', ti)
task_instances_list.append(ti.key)
```
This is the flamegraph generated by `py-spy` showing the performance of
`DagFileProcessor` in Airflow 1.10.12 before this PR:
https://raw.githubusercontent.com/yuqian90/airflow/gif_for_demo/airflow/www/static/flamegraph_before.svg
This is the performance after 1.10.12 is patched with this PR:
https://raw.githubusercontent.com/yuqian90/airflow/gif_for_demo/airflow/www/static/flamegraph_after.svg
The nice thing is that #4751 already addressed this issue for master branch.
We just need to cherry-pick it to fix this in 1.10.* with some very minor
conflict fixes.
While this PR will not fix every scenario that causes #10790, it does reduce
the `DagFileProcessor` time from around 100s to just about 12s for our use case
(a DAG with about 500 tasks, many of them are sensors in `reschedule` mode with
`poke_interval` 60s.).
Original commit message in #4751:
This decreases scheduler delay between tasks by about 20% for larger DAGs,
sometimes more for larger or more complex DAGs.
The delay between tasks can be a major issue, especially when we have dags
with
many subdags, figures out that the scheduling process spends plenty of time
in
dependency checking, we took the trigger rule dependency which calls the db
for
each task instance, we made it call the db just once for each dag_run
(cherry picked from commit 50efda5c69c1ddfaa869b408540182fb19f1a286)
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Decreasing scheduler delay between tasks
> ----------------------------------------
>
> Key: AIRFLOW-3607
> URL: https://issues.apache.org/jira/browse/AIRFLOW-3607
> Project: Apache Airflow
> Issue Type: Improvement
> Components: scheduler
> Affects Versions: 1.10.0, 1.10.1, 1.10.2
> Environment: ubuntu 14.04
> Reporter: Amichai Horvitz
> Assignee: Amichai Horvitz
> Priority: Major
> Fix For: 2.0.0
>
> Original Estimate: 336h
> Remaining Estimate: 336h
>
> I came across the TODO in airflow/ti_deps/deps/trigger_rule_dep (line 52)
> that says instead of checking the query for every task let the tasks report
> to the dagrun. I have a dag with many tasks and the delay between tasks can
> rise to 10 seconds or more, I already changed the configuration, added
> processes and memory, checked the code and did research, profiling and other
> experiments. I hope that this change will make a drastic change in the delay.
> I would be happy to discuss this solution, the research and other solutions
> for this issue.
> Thanks
--
This message was sent by Atlassian Jira
(v8.3.4#803005)