hkc-8010 commented on issue #65011: URL: https://github.com/apache/airflow/issues/65011#issuecomment-4234059964
Additional evidence: this is not limited to deferrable execution. We now have a second sanitized run on Airflow 3.1.8 where the customer switched to a non-deferrable subclass (`NonDeferrableMetlGlueJobOperator`), and the same duplicate-XCom pattern still reproduced. Run id: `manual__2026-04-10T15:34:19.499833+00:00` Scheduler-side timeline for that run: ```text 2026-04-10 15:34:20Z try 1 queued/running 2026-04-10 15:36:59Z try 1 finished as up_for_retry 2026-04-10 15:41:59Z try 2 queued/running 2026-04-10 15:44:41Z try 2 finished as failed ``` There was no `DEFERRED` state anywhere in this run. Worker / API timeline: ```text 2026-04-10 15:34:24Z try 1 POST xcom key `glue_job_run_details` -> 409 Conflict 2026-04-10 15:36:58Z try 1 POST xcom key `return_value` -> 409 Conflict 2026-04-10 15:42:01Z try 2 POST xcom key `glue_job_run_details` -> 409 Conflict 2026-04-10 15:44:40Z try 2 POST xcom key `return_value` -> 409 Conflict ``` Representative sanitized errors: ```text 409 Conflict The XCom with key: `glue_job_run_details` with mentioned task instance already exists. ``` ```text 409 Conflict The XCom with key: `return_value` with mentioned task instance already exists. ``` One especially suspicious detail from the retry boundary: - at try-2 startup we observed the API delete only `_link_GlueJobRunDetailsLink` - we did not observe deletion of `glue_job_run_details` - we did not observe deletion of `return_value` The relevant request sequence around try 2 looked like: ```text 15:42:00.099Z GET _link_GlueJobRunDetailsLink -> 200 15:42:00.114Z DELETE _link_GlueJobRunDetailsLink -> 200 15:42:01.911Z POST glue_job_run_details -> 409 15:44:40.881Z POST return_value -> 409 15:44:40.895Z GET glue_job_run_details -> 404 15:44:40.914Z POST _link_GlueJobRunDetailsLink -> 201 ``` So the issue is broader than the original deferrable-only framing. `deferrable=False` did not avoid the failure. The stronger current hypothesis is that there is a broader XCom cleanup / duplicate-create problem for Glue-related keys across retries, and deferral may only be one way to surface it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
