resume in 3.1.8 [airflow]

via GitHub Tue, 14 Apr 2026 04:53:13 -0700


shaleena commented on issue #65011:
URL: https://github.com/apache/airflow/issues/65011#issuecomment-4243649820


   Want to add that this is not limited to manual retries / clears
   
   A lot of our failures were reproducible after clearing the task instance or 
retrying the same DAG run, but we also observed cases where the scheduled run 
itself failed with the same duplicate-XCom pattern, without us manually 
rerunning it first.
   
   So manual retry / clear makes the issue easier to reproduce, but it does not 
appear to be the only trigger.
   
   **Workaround behavior**
   
   We tested disabling the default XCom behavior entirely:
   
   `do_xcom_push = False
   `
   avoid returning a value (so no return_value)
   manually push a custom key instead
   
   In deferrable mode, the stable version was to extract the run ID from the 
trigger event in execute_complete() and push it under a custom key:
   
   ```
   def execute_complete(self, context, event=None, **kwargs):
       run_id = event.get("run_id") if event else None
       if run_id:
           context["ti"].xcom_push(key="glue_run_id_debug", value=run_id)
       super().execute_complete(context, event=event, **kwargs)
       return None
   ```
   
   With this setup:
   
   tasks succeed consistently
   clearing / retrying no longer produces 409 errors
   the standard keys return_value and glue_job_run_details are no longer 
written in these runs
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] GlueJobOperator can hit duplicate XCom keys across retry / deferral / resume in 3.1.8 [airflow]

Reply via email to