itschrispeck commented on code in PR #14217: URL: https://github.com/apache/pinot/pull/14217#discussion_r1800411604
########## pinot-controller/src/main/java/org/apache/pinot/controller/validation/RealtimeSegmentValidationManager.java: ########## @@ -169,6 +171,10 @@ private void runSegmentLevelValidation(TableConfig tableConfig, StreamConfig str if (_llcRealtimeSegmentManager.isDeepStoreLLCSegmentUploadRetryEnabled()) { _llcRealtimeSegmentManager.uploadToDeepStoreIfMissing(tableConfig, segmentsZKMetadata); } + + if (_segmentAutoResetOnErrorAtValidation) { + _pinotHelixResourceManager.resetSegments(realtimeTableName, null, true); + } Review Comment: Adding a bit more background, for some of our largest clusters (>1M segments per zk) we sporadically find segments in error state. Many of these can be fixed with a simple reset and we would like to avoid operator intervention for these cases. One example case: 1. server 1 completes build, fails to upload to deep store 2. server 2 is being restarted/upgraded/replaced, when it starts up peer download fails 3. server 1 backfills deepstore via async upload task 4. server 2's segment needs to be reset to trigger deep-store download and load the segment We have tried increasing deep store upload/peer download timeouts/retries, but this isn't a great solution for us since it introduces more delays into the ingestion path -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org