tarun11Mavani opened a new pull request, #17126:
URL: https://github.com/apache/pinot/pull/17126
## Problem
`UpsertCompactionTask` consistently selects the same top segments with the
highest invalid record counts for compaction. When some of these top segments
encounter issues (corrupted data, processing failures, etc.) dur, they can
block the entire compaction queue, preventing other segments from being
compacted and reducing overall system throughput.
We ran into a similar issue when some of the deepstore copies were corrupted
and generator ended up selecting same set of segments that were corrupted and
it ended up stopping compaction entirely for a table.
## Solution
This PR introduces configurable randomization in segment selection to reduce
contention and improve compaction resilience:
- New Configuration Parameter: `segmentSelectionRandomizationFactor`
(default: 2.0)
When set to 2.0, selects from top (maxTasks × 2.0) segments as candidates,
then randomly picks maxTasks from them
- Values ≤ 1.0 disable randomization (deterministic behavior)
- Uses reservoir sampling algorithm for O(n) time complexity instead of
expensive shuffling operations
- Implements the randomization logic in BaseTaskGenerator as a reusable
method for other task generators
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]