rawwar opened a new issue, #52152:
URL: https://github.com/apache/airflow/issues/52152
### Apache Airflow Provider(s)
amazon
### Versions of Apache Airflow Providers
main
### Apache Airflow version
main
### Operating System
mac
### Deployment
Other
### Deployment details
_No response_
### What happened
Currently, `GlueJobHook`'s `async_get_job_state` and `get_job_state` does
not handle any exceptions raised by get_job_run in botocore and aiobotocore. A
customer has been facing intermittent issue and the GlueJobOperator is failing
on Airflow, even when the Job was successful on AWS
```
[2025-06-21, 07:06:21 UTC] {baseoperator.py:1787} ERROR - Trigger failed:
Traceback (most recent call last):
File
"/usr/local/lib/python3.12/site-packages/airflow/jobs/triggerer_job_runner.py",
line 558, in cleanup_finished_triggers
result = details["task"].result()
^^^^^^^^^^^^^^^^^^^^^^^^
File
"/usr/local/lib/python3.12/site-packages/airflow/jobs/triggerer_job_runner.py",
line 630, in run_trigger
async for event in trigger.run():
File
"/usr/local/lib/python3.12/site-packages/airflow/providers/amazon/aws/triggers/glue.py",
line 73, in run
await hook.async_job_completion(self.job_name, self.run_id, self.verbose)
File
"/usr/local/lib/python3.12/site-packages/airflow/providers/amazon/aws/hooks/glue.py",
line 314, in async_job_completion
job_run_state = await self.async_get_job_state(job_name, run_id)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File
"/usr/local/lib/python3.12/site-packages/airflow/providers/amazon/aws/hooks/glue.py",
line 215, in async_get_job_state
job_run = await client.get_job_run(JobName=job_name, RunId=run_id)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/aiobotocore/client.py", line
412, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (HttpTimeoutException)
when calling the GetJobRun operation: Could not write request before timeout
[2025-06-21, 07:06:21 UTC] {taskinstance.py:3310} ERROR - Task failed with
exception
```
### What you think should happen instead
We should gracefully handle exceptions and add retries
### How to reproduce
Its intermittent and not sure how to. But, we can reproduce this during
development by adding a test. We can mock `self.conn.get_job_run` to raise an
exception
### Anything else
_No response_
### Are you willing to submit PR?
- [x] Yes I am willing to submit a PR!
### Code of Conduct
- [x] I agree to follow this project's [Code of
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]