rawwar opened a new issue, #52152:
URL: https://github.com/apache/airflow/issues/52152

   ### Apache Airflow Provider(s)
   
   amazon
   
   ### Versions of Apache Airflow Providers
   
   main
   
   ### Apache Airflow version
   
   main
   
   ### Operating System
   
   mac
   
   ### Deployment
   
   Other
   
   ### Deployment details
   
   _No response_
   
   ### What happened
   
   Currently, `GlueJobHook`'s  `async_get_job_state` and `get_job_state` does 
not handle any exceptions raised by get_job_run in botocore and aiobotocore. A 
customer has been facing intermittent issue and the GlueJobOperator is failing 
on Airflow, even when the Job was successful on AWS
   
   
   ```
   [2025-06-21, 07:06:21 UTC] {baseoperator.py:1787} ERROR - Trigger failed:
   Traceback (most recent call last):
     File 
"/usr/local/lib/python3.12/site-packages/airflow/jobs/triggerer_job_runner.py", 
line 558, in cleanup_finished_triggers
       result = details["task"].result()
                ^^^^^^^^^^^^^^^^^^^^^^^^
     File 
"/usr/local/lib/python3.12/site-packages/airflow/jobs/triggerer_job_runner.py", 
line 630, in run_trigger
       async for event in trigger.run():
     File 
"/usr/local/lib/python3.12/site-packages/airflow/providers/amazon/aws/triggers/glue.py",
 line 73, in run
       await hook.async_job_completion(self.job_name, self.run_id, self.verbose)
     File 
"/usr/local/lib/python3.12/site-packages/airflow/providers/amazon/aws/hooks/glue.py",
 line 314, in async_job_completion
       job_run_state = await self.async_get_job_state(job_name, run_id)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File 
"/usr/local/lib/python3.12/site-packages/airflow/providers/amazon/aws/hooks/glue.py",
 line 215, in async_get_job_state
       job_run = await client.get_job_run(JobName=job_name, RunId=run_id)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "/usr/local/lib/python3.12/site-packages/aiobotocore/client.py", line 
412, in _make_api_call
       raise error_class(parsed_response, operation_name)
   botocore.exceptions.ClientError: An error occurred (HttpTimeoutException) 
when calling the GetJobRun operation: Could not write request before timeout
   [2025-06-21, 07:06:21 UTC] {taskinstance.py:3310} ERROR - Task failed with 
exception
   ```
   
   ### What you think should happen instead
   
   We should gracefully handle exceptions and add retries
   
   ### How to reproduce
   
   Its intermittent and not sure how to. But, we can reproduce this during 
development by adding a test. We can mock `self.conn.get_job_run` to raise an 
exception
   
   ### Anything else
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [x] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to