dpgaspar commented on PR #34826:
URL: https://github.com/apache/superset/pull/34826#issuecomment-3878724814
> > So I think we need to revisit the architecture to come up with a design
where we can coordinate connection pooling across the whole worker fleet.
>
> Agree with @villebro. To illustrate the problem in more detail:
>
> #### 1. Connection Pool Multiplication
> In `SINGLETON` mode, each pod/Celery worker has its own `EngineManager`
instance:
>
> Pod 1: EngineManager → Pool(size=5) → 5 connections Pod 2: EngineManager →
Pool(size=5) → 5 connections Pod 3: EngineManager → Pool(size=5) → 5
connections ...
>
> With 10 pods and pool size 5, you could hit 50 connections to the
database, not 5. This can easily exhaust database connection limits, especially
for managed databases with strict limits.
>
> #### 2. SSH Tunnel Multiplication
> Same problem with SSH tunnels:
>
> ```
> # Each pod creates its own tunnel cache
> self._tunnels: dict[TunnelKey, SSHTunnelForwarder] = {}
> ```
>
> If you have 20 Celery workers across multiple pods, you could have 20
separate SSH tunnels to the same bastion host. Many SSH servers limit
concurrent connections per user (often 10 by default).
>
> #### 3. Celery Prefork Workers Make It Worse
> Celery's default prefork model spawns multiple worker processes:
>
> Pod 1: └── Celery Worker (prefork, concurrency=4) ├── Process 1 →
EngineManager → Pool ├── Process 2 → EngineManager → Pool ├── Process 3 →
EngineManager → Pool └── Process 4 → EngineManager → Pool
>
> Each forked process gets its own memory space, so connection pools aren't
shared even within a single pod.
>
> #### 4. No Cluster-Wide Coordination
> The cache keys are deterministic but there's no coordination:
>
> ```
> # manager.py:258
> engine_key = self._get_engine_key(database, catalog, schema, source,
user_id)
> ```
>
> Each process independently decides to create connections without knowing
cluster-wide state.
>
> #### What's Missing for True Connection Pooling
> For distributed connection pooling, you'd typically need:
>
> 1. External connection pooler like PgBouncer, ProxySQL, or cloud-native
poolers (RDS Proxy, Cloud SQL Proxy)
> 2. Centralized state via Redis or similar for tunnel coordination
> 3. Connection limits per-pod that account for cluster size
I generally agree, these are valid concerns. On a generic web app,
connection pooling is highly desirable, same logic applies up to a certain
extent here.
Connection pooling provides measurable performance and stability benefits as
well as reduced connection churn improves database health
A small pool smooths that behavior and stabilizes connection usage. Also
given Superset's nature, longer connections, pooling avoids the connection
storm effect.
Dashboard refresh, burst of queries
→ burst of connection creations
→ burst of TLS/auth
→ DB CPU spike
Yes, pools multiply across pods and prefork workers. This doesn’t mean
pooling is wrong; it means we must size pools intentionally and cap pod HPA
scaling. Predicting a worst case scenario.
On Preset's case and probably many others, PgBouncer or ProxySQL are not
viable options.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]