tarun11Mavani opened a new pull request, #17126:
URL: https://github.com/apache/pinot/pull/17126

   ## Problem
   `UpsertCompactionTask` consistently selects the same top segments with the 
highest invalid record counts for compaction. When some of these top segments 
encounter issues (corrupted data, processing failures, etc.) dur, they can 
block the entire compaction queue, preventing other segments from being 
compacted and reducing overall system throughput.
   We ran into a similar issue when some of the deepstore copies were corrupted 
and generator ended up selecting same set of segments that were corrupted and 
it ended up stopping compaction entirely for a table. 
   
   ## Solution
   This PR introduces configurable randomization in segment selection to reduce 
contention and improve compaction resilience:
    - New Configuration Parameter: `segmentSelectionRandomizationFactor` 
(default: 2.0)
   When set to 2.0, selects from top (maxTasks × 2.0) segments as candidates, 
then randomly picks maxTasks from them
    - Values ≤ 1.0 disable randomization (deterministic behavior)
    - Uses reservoir sampling algorithm for O(n) time complexity instead of 
expensive shuffling operations
    - Implements the randomization logic in BaseTaskGenerator as a reusable 
method for other task generators
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to