itschrispeck commented on code in PR #14217:
URL: https://github.com/apache/pinot/pull/14217#discussion_r1800411604


##########
pinot-controller/src/main/java/org/apache/pinot/controller/validation/RealtimeSegmentValidationManager.java:
##########
@@ -169,6 +171,10 @@ private void runSegmentLevelValidation(TableConfig 
tableConfig, StreamConfig str
     if (_llcRealtimeSegmentManager.isDeepStoreLLCSegmentUploadRetryEnabled()) {
       _llcRealtimeSegmentManager.uploadToDeepStoreIfMissing(tableConfig, 
segmentsZKMetadata);
     }
+
+    if (_segmentAutoResetOnErrorAtValidation) {
+      _pinotHelixResourceManager.resetSegments(realtimeTableName, null, true);
+    }

Review Comment:
   Adding a bit more background, for some of our largest clusters (>1M segments 
per zk) we sporadically find segments in error state. Many of these can be 
fixed with a simple reset and we would like to avoid operator intervention for 
these cases.
   
   One example case: 
   1. server 1 completes build, fails to upload to deep store
   2. server 2 is being restarted/upgraded/replaced, when it starts up peer 
download fails
   3. server 1 backfills deepstore via async upload task
   4. server 2's segment needs to be reset to trigger deep-store download and 
load the segment 
   
   We have tried increasing deep store upload/peer download timeouts/retries, 
but this isn't a great solution for us since it introduces more delays into the 
ingestion path



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

Reply via email to