hkc-8010 commented on issue #65011:
URL: https://github.com/apache/airflow/issues/65011#issuecomment-4234059964

   Additional evidence: this is not limited to deferrable execution.
   
   We now have a second sanitized run on Airflow 3.1.8 where the customer 
switched to a non-deferrable subclass (`NonDeferrableMetlGlueJobOperator`), and 
the same duplicate-XCom pattern still reproduced.
   
   Run id:
   `manual__2026-04-10T15:34:19.499833+00:00`
   
   Scheduler-side timeline for that run:
   
   ```text
   2026-04-10 15:34:20Z  try 1 queued/running
   2026-04-10 15:36:59Z  try 1 finished as up_for_retry
   2026-04-10 15:41:59Z  try 2 queued/running
   2026-04-10 15:44:41Z  try 2 finished as failed
   ```
   
   There was no `DEFERRED` state anywhere in this run.
   
   Worker / API timeline:
   
   ```text
   2026-04-10 15:34:24Z  try 1 POST xcom key `glue_job_run_details` -> 409 
Conflict
   2026-04-10 15:36:58Z  try 1 POST xcom key `return_value` -> 409 Conflict
   2026-04-10 15:42:01Z  try 2 POST xcom key `glue_job_run_details` -> 409 
Conflict
   2026-04-10 15:44:40Z  try 2 POST xcom key `return_value` -> 409 Conflict
   ```
   
   Representative sanitized errors:
   
   ```text
   409 Conflict
   The XCom with key: `glue_job_run_details` with mentioned task instance 
already exists.
   ```
   
   ```text
   409 Conflict
   The XCom with key: `return_value` with mentioned task instance already 
exists.
   ```
   
   One especially suspicious detail from the retry boundary:
   - at try-2 startup we observed the API delete only 
`_link_GlueJobRunDetailsLink`
   - we did not observe deletion of `glue_job_run_details`
   - we did not observe deletion of `return_value`
   
   The relevant request sequence around try 2 looked like:
   
   ```text
   15:42:00.099Z  GET    _link_GlueJobRunDetailsLink  -> 200
   15:42:00.114Z  DELETE _link_GlueJobRunDetailsLink  -> 200
   15:42:01.911Z  POST   glue_job_run_details         -> 409
   15:44:40.881Z  POST   return_value                 -> 409
   15:44:40.895Z  GET    glue_job_run_details         -> 404
   15:44:40.914Z  POST   _link_GlueJobRunDetailsLink  -> 201
   ```
   
   So the issue is broader than the original deferrable-only framing. 
`deferrable=False` did not avoid the failure. The stronger current hypothesis 
is that there is a broader XCom cleanup / duplicate-create problem for 
Glue-related keys across retries, and deferral may only be one way to surface 
it.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to