kaxil commented on issue #65010:
URL: https://github.com/apache/airflow/issues/65010#issuecomment-4227713013
## Triage: Root Cause Analysis
I investigated the diff between 3.1.8 and 3.2.0 and identified multiple
compounding causes for the connection pool exhaustion. The issue is **not** a
single leak but several changes that collectively increase how long each DB
connection is held and how many connections are created.
### Primary cause: New `Log` INSERTs inside `FOR UPDATE` transactions
**Commit** `ed237dff7c` ("Fix task-level audit logs missing success/running
events")
3.1.8 had **zero** `Log` writes in the execution API. 3.2.0 added
`session.add(Log(...))` in two places inside transactions that hold `FOR
UPDATE` row locks on the `task_instance` table:
- [`ti_run()` line
209](https://github.com/apache/airflow/blob/3.2.0/airflow-core/src/airflow/api_fastapi/execution_api/routes/task_instances.py#L209)
-- fires on every QUEUED/RESTARTING -> RUNNING transition
- [`ti_update_state()` line
420](https://github.com/apache/airflow/blob/3.2.0/airflow-core/src/airflow/api_fastapi/execution_api/routes/task_instances.py#L420)
-- fires on every terminal state transition
The `FOR UPDATE` lock is acquired at line 153 (`with_for_update(of=TI)`) and
held until the session commits (in the `_get_session()` dependency cleanup).
The `Log` INSERT must complete before commit releases the lock. Under high task
throughput, a slow `log` table INSERT (index contention, table bloat) directly
extends connection hold time.
For a DAG with 100 tasks, this adds 200+ extra INSERTs per DAG run that
didn't exist in 3.1.8.
### Contributing cause: New lazy-loaded `ti.dag_run` inside FOR UPDATE
**Commit** `d659d38965` ("Introduce parent task spans")
[`_emit_task_span()` at line
442](https://github.com/apache/airflow/blob/3.2.0/airflow-core/src/airflow/api_fastapi/execution_api/routes/task_instances.py#L442)
accesses `ti.dag_run`, triggering a lazy-loaded SELECT while the `FOR UPDATE`
lock is still held. This is a new N+1 query pattern.
### Contributing cause: Wasted async sessions on every connection/variable
lookup
**Commits** `d32091cd4f`, `a4f3417bb0` (team boundary checks)
[`get_team_name_dep`](https://github.com/apache/airflow/blob/3.2.0/airflow-core/src/airflow/api_fastapi/execution_api/security.py#L225)
takes `AsyncSessionDep` as a FastAPI dependency. Even when `multi_team=False`
(the default), FastAPI resolves the dependency *before* calling the function --
so an async session is created, a DB connection is checked out, committed
(no-op), and returned on **every** connection/variable request.
In 3.1.8, the async engine pool was essentially dormant (zero usage in
execution API). In 3.2.0, it's hit on every connection/variable fetch from
workers.
### Contributing cause: Async engine has no pool health configuration
The sync engine gets `pool_recycle=1800` and `pool_pre_ping=True` from
Airflow config. The async engine uses SQLAlchemy defaults: `pool_recycle=-1`
(never) and `pool_pre_ping=False`. This was the same in 3.1.8, but back then
the async pool was dormant.
With `pool_recycle=-1`, idle connections are never refreshed. PostgreSQL
idle timeouts, pgbouncer, or firewall rules can kill connections server-side,
but the pool doesn't detect this without `pool_pre_ping`.
### Contributing cause: `dispose_orm()` doesn't dispose async engine
[`dispose_orm()`](https://github.com/apache/airflow/blob/3.2.0/airflow-core/src/airflow/settings.py#L592-L613)
only disposes the sync engine. On gunicorn worker exit or process termination,
async pool connections are abandoned.
### Net effect
Each execution API request holds a DB connection longer than in 3.1.8 due to
Log INSERTs + lazy loads inside `FOR UPDATE` transactions. New async pool
pressure from `get_team_name_dep` doubles the connection footprint. Under high
task throughput, the pool saturates because connections are returned slower
than they're checked out. With pgbouncer in transaction mode, the long-held
transactions amplify the problem.
### Suggested workarounds until a fix is available
1. Increase `sql_alchemy_pool_size` and `sql_alchemy_max_overflow` in
`[database]` config
2. Increase PostgreSQL `max_connections`
3. If using pgbouncer, increase `default_pool_size` and `max_client_conn`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]