It’s not even a core change — give the bug only affects Kube operators the fix can be included in the next provider release.
-ash > On 6 Aug 2025, at 08:59, Jarek Potiuk <[email protected]> wrote: > > Yes. We merge about 20 PRs a day - that is 140 PRs a week. Probably ~10% of > those is some fix to some (more or less obscure) core logic change. You > should take a look at all the PRs that are getting merged and reviewed. > > This is just one of them. There is no need for 1000s of people who are > subscribed to devlist and drop everything they do to read about this > particular thing. > > Don't take it as a personal critique - this is just friendly information > about how we do stuff here. When I wrote my first message to devlist I had > a very polite and nice response that was helpful, and nicely told me that > here we do things differently. I took it as a lesson and 6 years later I am > one of the PMC members and one that cares about the community a lot. > Hopefully you can treat that message in a similar way. > > J > > On Wed, Aug 6, 2025 at 9:26 AM Jigar Parekh <[email protected]> wrote: > >> Well, my email is not about the single PR or a follow up on that PR. It is >> referring to an issue in the core logic that results into DAG hash change. >> >> Jigar >> >>> On Aug 6, 2025, at 12:20 AM, Jarek Potiuk <[email protected]> wrote: >>> >>> Yep. Thanks for the heads up. >>> >>> We saw both PR and the issue and it is scheduled for 3.0.5 - it did not >>> make it in 3.0.4. I think it would be good if you confirm in your PR that >>> you applied the patch and show some evidences of what happened - before >> and >>> after, not only "word" explanation - words and textual description is >> often >>> prone to interpretation - but if you show what happens before you applied >>> the patch and after, that would make it way easier to confirm that it >>> works as expected. >>> >>> And then just patiently remind in your PR if things are not reviewed - >> like >>> in other PRs and fixes. >>> >>> Also just for your information - (and anyone looking here as an >>> educational message). We should avoid sending such single-PR, >>> relatively obscure issue-related messages to devlist. >>> >>> We try to reserve the devlist communication for important information >> that >>> affects airflow decisions, all contributors, the way we do development, >>> discussions about the future of our development and important features >>> discussions. >>> We rarely (if at all) use it to discuss individual bug fixes and PRs >>> (unless those are absolutely critical fixes that need to be fixed >>> immediately) - because it adds a lot of noise to our inboxes. Devlist >>> discussions are the ones that we should really focus on - most people in >>> the community should read and at least think about the things we post at >>> the devlist, so posting about single bug and PR is adding a lot of >>> cognitive overload for everyone, It's better to keep such messages to the >>> PRs and issues in GitHub, >>> >>> Empathy towards all the people in the community is important part of >>> playing "well" in the community, so I hope we all understand that and >>> follow those. >>> >>> J. >>> >>> >>> >>> >>>> On Wed, Aug 6, 2025 at 8:53 AM Jigar Parekh <[email protected]> wrote: >>>> >>>> I have been looking into Airflow metadata database level bottlenecks. In >>>> my analysis so far, I observed that change of dag hash at run time for >> any >>>> reason has a significant negative impact on the database because it >> blocks >>>> dag run updates for last scheduling resulting into higher lock waits >> and in >>>> many instances lock wait timeouts. I recently opened an issue #53957 >>>> showing one instance where dag hash changes just because template field >>>> order is different and I also suggested a fix with a PR >>>> #54041.Troubleshooting the lock waits further, I have come across a >>>> scenario that is rare but it is resulting into unnecessary dag hash >> change. >>>> This, in my opinion, needs community experts’ attention and review. The >>>> details are below. >>>> Airflow version: 2.x or (also 3.x based on the code) >>>> Airflow config: >>>> Executor: k8 >>>> AIRFLOW__SCHEDULER__SCHEDULE_AFTER_TASK_EXECUTION: False >>>> AIRFLOW__CORE__MAX_NUM_RENDERED_TI_FIELDS_PER_TASK: 0 >>>> AIRFLOW__CORE__PARALLELISM: 250 >>>> DAG: any DAG with dag Params and multiple retries for tasks with a retry >>>> callback >>>> >>>> Steps: >>>> 1. Trigger DAG with overriding the param default value >>>> 2. Create a zombie task in the run e.g. remove the executor pod while >> the >>>> task is running >>>> 3. Observe the scheduler log (enable debug if possible) and serialized >> dag >>>> table, dag hash is updated with the new value. If you compare with the >> old >>>> serialized value in the data column, you will see that the difference is >>>> the new serialized value now has param values from the run that had >> zombie >>>> task failure >>>> 4. This results into an additional dag run update statement with last >>>> scheduling update statement and takes longer to execute when you have >>>> multiple tasks executing simultaneously. This multiplies further if a >> DAG >>>> run has multiple zombie task failures at the same time from different >> runs >>>> with different Param valuesCode analysis: (I have looked at the code for >>>> tag 2.10.5 because I am using that version in production but latest code >>>> appears to be similar in logic)Based on the code analysis, I see that >> DAG >>>> processor in the scheduler executes callbacks before serialization of >> the >>>> DAG in processor.py -> process_file function which calls >> taskinstance.py -> >>>> handle_failure function that ends up calling get_template_context having >>>> process_params function call updating params value to the values from >> DAG >>>> run conf. This causes param default value to change in the serialized >> DAG >>>> and change in the DAG hash value >>>> It appears that handle_failure is being called in other scenarios where >>>> updating params values to the ones from DAG run conf may be required >> but in >>>> this scenario it does not seem to be required. So far I am unable to >> find >>>> any ways to resolve this problem >>>> I hope this information helps to understand the problem. >>>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> >> --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
