Re: [PR] feat: a manager for SQLAlchemy engines [superset]

via GitHub Tue, 10 Feb 2026 07:48:23 -0800


dpgaspar commented on PR #34826:
URL: https://github.com/apache/superset/pull/34826#issuecomment-3878724814


   > > So I think we need to revisit the architecture to come up with a design 
where we can coordinate connection pooling across the whole worker fleet.
   > 
   > Agree with @villebro. To illustrate the problem in more detail:
   > 
   > #### 1. Connection Pool Multiplication
   > In `SINGLETON` mode, each pod/Celery worker has its own `EngineManager` 
instance:
   > 
   > Pod 1: EngineManager → Pool(size=5) → 5 connections Pod 2: EngineManager → 
Pool(size=5) → 5 connections Pod 3: EngineManager → Pool(size=5) → 5 
connections ...
   > 
   > With 10 pods and pool size 5, you could hit 50 connections to the 
database, not 5. This can easily exhaust database connection limits, especially 
for managed databases with strict limits.
   > 
   > #### 2. SSH Tunnel Multiplication
   > Same problem with SSH tunnels:
   > 
   > ```
   > # Each pod creates its own tunnel cache   
   > self._tunnels: dict[TunnelKey, SSHTunnelForwarder] = {}   
   > ```
   > 
   > If you have 20 Celery workers across multiple pods, you could have 20 
separate SSH tunnels to the same bastion host. Many SSH servers limit 
concurrent connections per user (often 10 by default).
   > 
   > #### 3. Celery Prefork Workers Make It Worse
   > Celery's default prefork model spawns multiple worker processes:
   > 
   > Pod 1: └── Celery Worker (prefork, concurrency=4) ├── Process 1 → 
EngineManager → Pool ├── Process 2 → EngineManager → Pool ├── Process 3 → 
EngineManager → Pool └── Process 4 → EngineManager → Pool
   > 
   > Each forked process gets its own memory space, so connection pools aren't 
shared even within a single pod.
   > 
   > #### 4. No Cluster-Wide Coordination
   > The cache keys are deterministic but there's no coordination:
   > 
   > ```
   > # manager.py:258
   > engine_key = self._get_engine_key(database, catalog, schema, source, 
user_id)
   > ```
   > 
   > Each process independently decides to create connections without knowing 
cluster-wide state.
   > 
   > #### What's Missing for True Connection Pooling
   > For distributed connection pooling, you'd typically need:
   > 
   > 1. External connection pooler like PgBouncer, ProxySQL, or cloud-native 
poolers (RDS Proxy, Cloud SQL Proxy)
   > 2. Centralized state via Redis or similar for tunnel coordination
   > 3. Connection limits per-pod that account for cluster size
   
   I generally agree, these are valid concerns. On a generic web app, 
connection pooling is highly desirable, same logic applies up to a certain 
extent here.
   
   Connection pooling provides measurable performance and stability benefits as 
well as reduced connection churn improves database health
   A small pool smooths that behavior and stabilizes connection usage. Also 
given Superset's nature, longer connections, pooling avoids the connection 
storm effect. 
   Dashboard refresh, burst of queries
   → burst of connection creations
   → burst of TLS/auth
   → DB CPU spike
   
   Yes, pools multiply across pods and prefork workers. This doesn’t mean 
pooling is wrong; it means we must size pools intentionally and cap pod HPA 
scaling. Predicting a worst case scenario.
   
   On Preset's case and probably many others, PgBouncer or ProxySQL are not 
viable options.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] feat: a manager for SQLAlchemy engines [superset]

Reply via email to