Re: [DISCUSS] Unnecessary dag hash changes at dag run time

Ash Berlin-Taylor Wed, 06 Aug 2025 01:35:25 -0700

It’s not even a core change — give the bug only affects Kube operators the fix 
can be included in the next provider release.


-ash

> On 6 Aug 2025, at 08:59, Jarek Potiuk <[email protected]> wrote:
> 
> Yes. We merge about 20 PRs a day - that is 140 PRs a week. Probably ~10% of
> those is some fix to some (more or less obscure) core logic change. You
> should take a look at all the PRs that are getting merged and reviewed.
> 
> This is just one of them. There is no need for 1000s of people who are
> subscribed to devlist and drop everything they do to read about this
> particular thing.
> 
> Don't take it as a personal critique - this is just friendly information
> about how we do stuff here. When I wrote my first message to devlist I had
> a very polite and nice response that was helpful,  and nicely told me that
> here we do things differently. I took it as a lesson and 6 years later I am
> one of the PMC members and one that cares about the community a lot.
> Hopefully you can treat that message in a similar way.
> 
> J
> 
> On Wed, Aug 6, 2025 at 9:26 AM Jigar Parekh <[email protected]> wrote:
> 
>> Well, my email is not about the single PR or a follow up on that PR. It is
>> referring to an issue in the core logic that results into DAG hash change.
>> 
>> Jigar
>> 
>>> On Aug 6, 2025, at 12:20 AM, Jarek Potiuk <[email protected]> wrote:
>>> 
>>> Yep. Thanks for the heads up.
>>> 
>>> We saw both PR and the issue and it is scheduled for 3.0.5 - it did not
>>> make it in 3.0.4. I think it would be good if you confirm in your PR that
>>> you applied the patch and show some evidences of what happened - before
>> and
>>> after, not only "word" explanation - words and textual description is
>> often
>>> prone to interpretation - but if you show what happens before you applied
>>> the patch and after, that would make it way easier to confirm that it
>>> works as expected.
>>> 
>>> And then just patiently remind in your PR if things are not reviewed -
>> like
>>> in other PRs and fixes.
>>> 
>>> Also just for your information - (and anyone looking here as an
>>> educational message). We should avoid sending such single-PR,
>>> relatively obscure issue-related messages to devlist.
>>> 
>>> We try to reserve the devlist communication for important information
>> that
>>> affects airflow decisions, all contributors, the way we do development,
>>> discussions about the future of our development and important features
>>> discussions.
>>> We rarely (if at all) use it to discuss individual bug fixes and PRs
>>> (unless those are absolutely critical fixes that need to be fixed
>>> immediately) - because it adds a lot of noise to our inboxes. Devlist
>>> discussions are the ones that we should really focus on - most people in
>>> the community should read and at least think about the things we post at
>>> the devlist, so posting about single bug and PR is adding a lot of
>>> cognitive overload for everyone, It's better to keep such messages to the
>>> PRs and issues in GitHub,
>>> 
>>> Empathy towards all the people in the community is important part of
>>> playing "well" in the community, so I hope we all understand that and
>>> follow those.
>>> 
>>> J.
>>> 
>>> 
>>> 
>>> 
>>>> On Wed, Aug 6, 2025 at 8:53 AM Jigar Parekh <[email protected]> wrote:
>>>> 
>>>> I have been looking into Airflow metadata database level bottlenecks. In
>>>> my analysis so far, I observed that change of dag hash at run time for
>> any
>>>> reason has a significant negative impact on the database because it
>> blocks
>>>> dag run updates for last scheduling resulting into higher lock waits
>> and in
>>>> many instances lock wait timeouts. I recently opened an issue #53957
>>>> showing one instance where dag hash changes just because template field
>>>> order is different and I also suggested a fix with a PR
>>>> #54041.Troubleshooting the lock waits further, I have come across a
>>>> scenario that is rare but it is resulting into unnecessary dag hash
>> change.
>>>> This, in my opinion, needs community experts’ attention and review. The
>>>> details are below.
>>>> Airflow version: 2.x or (also 3.x based on the code)
>>>> Airflow config:
>>>> Executor: k8
>>>> AIRFLOW__SCHEDULER__SCHEDULE_AFTER_TASK_EXECUTION: False
>>>> AIRFLOW__CORE__MAX_NUM_RENDERED_TI_FIELDS_PER_TASK: 0
>>>> AIRFLOW__CORE__PARALLELISM: 250
>>>> DAG: any DAG with dag Params and multiple retries for tasks with a retry
>>>> callback
>>>> 
>>>> Steps:
>>>> 1. Trigger DAG with overriding the param default value
>>>> 2. Create a zombie task in the run e.g. remove the executor pod while
>> the
>>>> task is running
>>>> 3. Observe the scheduler log (enable debug if possible) and serialized
>> dag
>>>> table, dag hash is updated with the new value. If you compare with the
>> old
>>>> serialized value in the data column, you will see that the difference is
>>>> the new serialized value now has param values from the run that had
>> zombie
>>>> task failure
>>>> 4. This results into an additional dag run update statement with last
>>>> scheduling update statement and takes longer to execute when you have
>>>> multiple tasks executing simultaneously. This multiplies further if a
>> DAG
>>>> run has multiple zombie task failures at the same time from different
>> runs
>>>> with different Param valuesCode analysis: (I have looked at the code for
>>>> tag 2.10.5 because I am using that version in production but latest code
>>>> appears to be similar in logic)Based on the code analysis, I see that
>> DAG
>>>> processor in the scheduler executes callbacks before serialization of
>> the
>>>> DAG in processor.py -> process_file function which calls
>> taskinstance.py ->
>>>> handle_failure function that ends up calling get_template_context having
>>>> process_params function call updating params value to the values from
>> DAG
>>>> run conf. This causes param default value to change in the serialized
>> DAG
>>>> and change in the DAG hash value
>>>> It appears that handle_failure is being called in other scenarios where
>>>> updating params values to the ones from DAG run conf may be required
>> but in
>>>> this scenario it does not seem to be required. So far I am unable to
>> find
>>>> any ways to resolve this problem
>>>> I hope this information helps to understand the problem.
>>>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>> 
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [DISCUSS] Unnecessary dag hash changes at dag run time

Reply via email to