qswawrq opened a new issue, #17178:
URL: https://github.com/apache/pinot/issues/17178
# ZooKeeper Contention and Linear Performance Degradation with
High-Parallelism Batch Ingestion
## Environment
- **Pinot Version**: apachepinot/pinot:latest
- **Deployment**: Kubernetes
- **Cluster Size**: 1 Controller, 1 ZooKeeper, 3 Brokers, 3 Servers
- **ZooKeeper Config**:
- `ZOO_SNAPCOUNT=100000`
- `ZOO_AUTOPURGE_INTERVAL=1`
- `ZOO_AUTOPURGE_RETAIN_COUNT=5`
- Heap: 1GB
- Storage: 20GB used
## Problem Description
We're experiencing severe ZooKeeper contention and linear performance
degradation during batch ingestion of 50,000 parquet files (each around 350 MB)
using 100 parallel Kubernetes Job workers with `jobType:
SegmentCreationAndMetadataPush`.
### Performance Degradation Timeline
| Time Elapsed | Files/Worker | Total Segments | Time per File | Degradation
Factor |
|--------------|--------------|----------------|---------------|-------------------|
| Initial | 0-5 | 500 | 2 minutes | 1x (baseline) |
| +3 hours | 8-9 | 1,002 | 20 minutes | 10x |
| +30 hours | 43-44 | 4,438 | 88 minutes | 44x |
| +50 hours | 73-76 | 6,150 | 175 minutes | 87x |
**Performance continues to degrade linearly as segment count grows.**
**Critical Finding**: We tested with 2, 5, and 100 workers. Linear
degradation occurs at **all parallelism levels**, indicating the bottleneck is
not just concurrency, but the **O(n) cost of reading/writing the growing
segment list**. It seems the ingestion workers take a little time to build out
the segments locally but spend 99% time on trying to upload metadata.
## Observed Behavior
### 1. ZooKeeper Metadata Update Failures (17% Error Rate)
Segment uploads frequently fail with optimistic locking errors. In the last
10,000 log lines, we observed **111 ZK version conflicts out of 661 upload
attempts** (16.8% failure rate):
```
2025/11/09 15:47:07.055 ERROR [ZKOperator]
[jersey-server-managed-async-executor-128]
Failed to update ZK metadata for segment:
post_metrics_OFFLINE_20315_20344_000000021514_0,
table: post_metrics_OFFLINE, expected version: 3428
2025/11/09 15:47:07.055 ERROR [PinotSegmentUploadDownloadRestletResource]
[jersey-server-managed-async-executor-128]
Exception while uploading segment: Failed to update ZK metadata for segment:
post_metrics_OFFLINE_20315_20344_000000021514_0,
table: post_metrics_OFFLINE, expected version: 3428
java.lang.RuntimeException: Failed to update ZK metadata for segment:
post_metrics_OFFLINE_20315_20344_000000021514_0,
table: post_metrics_OFFLINE, expected version: 3428
at
org.apache.pinot.controller.api.upload.ZKOperator.processExistingSegment(ZKOperator.java:341)
at
org.apache.pinot.controller.api.upload.ZKOperator.completeSegmentOperations(ZKOperator.java:120)
at
org.apache.pinot.controller.api.resources.PinotSegmentUploadDownloadRestletResource.uploadSegment(PinotSegmentUploadDownloadRestletResource.java:433)
```
### 2. Large ZooKeeper Transaction Logs
ZooKeeper transaction logs have grown to **multi-GB sizes**:
```bash
$ kubectl exec pinot-zookeeper-0 -- ls -lh /bitnami/zookeeper/data/version-2/
-rw-r--r-- 1 1001 1001 129M Nov 10 16:42 log.1c4167f
-rw-r--r-- 1 1001 1001 65M Nov 10 17:00 log.1c58844
-rw-r--r-- 1 1001 1001 1.0G Nov 10 17:28 log.1c78fc6
-rw-r--r-- 1 1001 1001 1.9G Nov 10 17:44 log.1c86b58
-rw-r--r-- 1 1001 1001 2.2G Nov 10 18:32 log.1cc3e55 ← Largest log
Total ZK data directory: 5.4GB
```
Despite autopurge being enabled (`ZOO_AUTOPURGE_INTERVAL=1`,
`ZOO_SNAPCOUNT=100000`), individual transaction logs grow to 2.2GB before
snapshots are taken. Not sure if it is relevant to the performance degrade.
## Configuration
**Ingestion Job Spec:**
```yaml
jobType: SegmentCreationAndMetadataPush
pushJobSpec:
pushAttempts: 10
pushRetryIntervalMillis: 2000
pushFileNamePattern:
'glob:**post_metrics_OFFLINE_*_*_$FILE_PADDED_*.tar.gz'
```
**Kubernetes Job:**
- `completions: 100`
- `parallelism: 100`
- `completionMode: Indexed`
Each worker processes 500 files sequentially, creating one segment per file.
## Questions
1. **Is this a known limitation** of single-ZK deployments with
high-parallelism ingestion?
2. **What is the recommended parallelism architecture** for metadata push
operations to avoid ZK contention?
Thank you!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]