ankitsultana opened a new issue, #14276:
URL: https://github.com/apache/pinot/issues/14276

   We often get error segments due to the inability of one of the replicas to 
download an online segment from a peer.
   
   And most of the times, we aren't able to see a clear error message for this. 
Stack trace looks something like the following:
   
   ```
   org.apache.pinot.spi.utils.retry.AttemptsExceededException: Operation failed 
after 5 attempts
        at 
org.apache.pinot.spi.utils.retry.BaseRetryPolicy.attempt(BaseRetryPolicy.java:65)
        at 
org.apache.pinot.core.util.PeerServerSegmentFinder.getPeerServerURIs(PeerServerSegmentFinder.java:81)
        at org.a
   
pache.pinot.core.util.PeerServerSegmentFinder.getPeerServerURIs(PeerServerSegmentFinder.java:67)
        at 
org.apache.pinot.core.data.manager.realtime.RealtimeTableDataManager.lambda$downloadSegmentFromPeer$4(RealtimeTableData
   Manager.java:666)
        at 
org.apache.pinot.common.utils.fetcher.BaseSegmentFetcher.lambda$fetchSegmentToLocal$2(BaseSegmentFetcher.java:127)
        at 
org.apache.pinot.spi.utils.retry.BaseRetryPolicy.attempt(BaseRetryPolicy.java
   :50)
        at 
org.apache.pinot.common.utils.fetcher.BaseSegmentFetcher.fetchSegmentToLocal(BaseSegmentFetcher.java:126)
        at 
org.apache.pinot.core.data.manager.realtime.RealtimeTableDataManager.downloadSegmentFromPeer(Realti
   meTableDataManager.java:663)
        at 
org.apache.pinot.core.data.manager.realtime.RealtimeTableDataManager.downloadAndReplaceSegment(RealtimeTableDataManager.java:606)
        at org.apache.pinot.core.data.manager.realtime.Realtim
   
eSegmentDataManager.downloadSegmentAndReplace(RealtimeSegmentDataManager.java:1294)
        at 
org.apache.pinot.core.data.manager.realtime.RealtimeSegmentDataManager.goOnlineFromConsuming(RealtimeSegmentDataManager.java:1233)
   ```
   
   The corresponding code at the commit from which this build was made is shown 
below. From a operational experience point of view, I think we need the 
following improvements here:
   
   1. If the predicate is onlineServers.isEmpty() is true after all attempts, 
the logs should clearly indicate that this was the reason for the attempt 
exhaustion.
   2. There should be some way to log the last instance state map seen for this 
segment during the retries. This can help in knowing the exact EV the Servers 
were seeing at the time of the failure.
   3. If an exception is thrown in `getOnlineServersFromExternalView`, that 
exception should be clearly logged. Maybe this is already happening?
   
   <img width="795" alt="image" 
src="https://github.com/user-attachments/assets/2476b3e6-8ade-4a33-b608-40d7898089b0";>
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

Reply via email to