jadami10 opened a new issue, #10460:
URL: https://github.com/apache/pinot/issues/10460

   We were working through an incident today where we lost a number of realtime 
servers. When new servers came up, they were waiting up to 45 minutes to catch 
up before `/health` would return `OK`. Because they were taking up to 30 
minutes to consume up to realtime, we decided to force commit consuming 
segments so they could just download the data from S3.
   
   For the most part, this worked great. But in one case, the server that was 
behind remained behind, and the other 2 healthy servers went from being 10 
seconds of lag to reporting they were equally behind.
   
   My hypothesis is the segment commit protocol picked the unhealthy/lagging 
server to be the committer. I'm confident we saw the lag spike after the force 
commit, so there is definitely an issue here. But I will spend some time 
validating the actual mechanism by which this happened tomorrow.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

Reply via email to