[I] ZooKeeper Contention During Batch Ingestion [pinot]

via GitHub Mon, 10 Nov 2025 11:14:26 -0800


qswawrq opened a new issue, #17178:
URL: https://github.com/apache/pinot/issues/17178


   # ZooKeeper Contention and Linear Performance Degradation with 
High-Parallelism Batch Ingestion
   
   ## Environment
   - **Pinot Version**: apachepinot/pinot:latest
   - **Deployment**: Kubernetes
   - **Cluster Size**: 1 Controller, 1 ZooKeeper, 3 Brokers, 3 Servers
   - **ZooKeeper Config**:
     - `ZOO_SNAPCOUNT=100000`
     - `ZOO_AUTOPURGE_INTERVAL=1`
     - `ZOO_AUTOPURGE_RETAIN_COUNT=5`
     - Heap: 1GB
     - Storage: 20GB used
   
   ## Problem Description
   
   We're experiencing severe ZooKeeper contention and linear performance 
degradation during batch ingestion of 50,000 parquet files (each around 350 MB) 
using 100 parallel Kubernetes Job workers with `jobType: 
SegmentCreationAndMetadataPush`.
   
   ### Performance Degradation Timeline
   
   | Time Elapsed | Files/Worker | Total Segments | Time per File | Degradation 
Factor |
   
|--------------|--------------|----------------|---------------|-------------------|
   | Initial | 0-5 | 500 | 2 minutes | 1x (baseline) |
   | +3 hours | 8-9 | 1,002 | 20 minutes | 10x |
   | +30 hours | 43-44 | 4,438 | 88 minutes | 44x |
   | +50 hours | 73-76 | 6,150 | 175 minutes | 87x |
   
   **Performance continues to degrade linearly as segment count grows.**
   
   **Critical Finding**: We tested with 2, 5, and 100 workers. Linear 
degradation occurs at **all parallelism levels**, indicating the bottleneck is 
not just concurrency, but the **O(n) cost of reading/writing the growing 
segment list**. It seems the ingestion workers take a little time to build out 
the segments locally but spend 99% time on trying to upload metadata.
   
   ## Observed Behavior
   
   ### 1. ZooKeeper Metadata Update Failures (17% Error Rate)
   
   Segment uploads frequently fail with optimistic locking errors. In the last 
10,000 log lines, we observed **111 ZK version conflicts out of 661 upload 
attempts** (16.8% failure rate):
   
   ```
   2025/11/09 15:47:07.055 ERROR [ZKOperator] 
[jersey-server-managed-async-executor-128] 
   Failed to update ZK metadata for segment: 
post_metrics_OFFLINE_20315_20344_000000021514_0, 
   table: post_metrics_OFFLINE, expected version: 3428
   
   2025/11/09 15:47:07.055 ERROR [PinotSegmentUploadDownloadRestletResource] 
[jersey-server-managed-async-executor-128] 
   Exception while uploading segment: Failed to update ZK metadata for segment: 
post_metrics_OFFLINE_20315_20344_000000021514_0, 
   table: post_metrics_OFFLINE, expected version: 3428
   java.lang.RuntimeException: Failed to update ZK metadata for segment: 
post_metrics_OFFLINE_20315_20344_000000021514_0, 
   table: post_metrics_OFFLINE, expected version: 3428
        at 
org.apache.pinot.controller.api.upload.ZKOperator.processExistingSegment(ZKOperator.java:341)
        at 
org.apache.pinot.controller.api.upload.ZKOperator.completeSegmentOperations(ZKOperator.java:120)
        at 
org.apache.pinot.controller.api.resources.PinotSegmentUploadDownloadRestletResource.uploadSegment(PinotSegmentUploadDownloadRestletResource.java:433)
   ```
   
   ### 2. Large ZooKeeper Transaction Logs
   
   ZooKeeper transaction logs have grown to **multi-GB sizes**:
   
   ```bash
   $ kubectl exec pinot-zookeeper-0 -- ls -lh /bitnami/zookeeper/data/version-2/
   
   -rw-r--r-- 1 1001 1001 129M Nov 10 16:42 log.1c4167f
   -rw-r--r-- 1 1001 1001  65M Nov 10 17:00 log.1c58844
   -rw-r--r-- 1 1001 1001 1.0G Nov 10 17:28 log.1c78fc6
   -rw-r--r-- 1 1001 1001 1.9G Nov 10 17:44 log.1c86b58
   -rw-r--r-- 1 1001 1001 2.2G Nov 10 18:32 log.1cc3e55  ← Largest log
   
   Total ZK data directory: 5.4GB
   ```
   
   Despite autopurge being enabled (`ZOO_AUTOPURGE_INTERVAL=1`, 
`ZOO_SNAPCOUNT=100000`), individual transaction logs grow to 2.2GB before 
snapshots are taken. Not sure if it is relevant to the performance degrade.
   
   ## Configuration
   
   **Ingestion Job Spec:**
   ```yaml
   jobType: SegmentCreationAndMetadataPush
   
   pushJobSpec:
     pushAttempts: 10
     pushRetryIntervalMillis: 2000
     pushFileNamePattern: 
'glob:**post_metrics_OFFLINE_*_*_$FILE_PADDED_*.tar.gz'
   ```
   
   **Kubernetes Job:**
   - `completions: 100`
   - `parallelism: 100`
   - `completionMode: Indexed`
   
   Each worker processes 500 files sequentially, creating one segment per file.
   
   ## Questions
   
   1. **Is this a known limitation** of single-ZK deployments with 
high-parallelism ingestion?
   
   2. **What is the recommended parallelism architecture** for metadata push 
operations to avoid ZK contention?
   
   Thank you!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] ZooKeeper Contention During Batch Ingestion [pinot]

Reply via email to