jadami10 opened a new issue, #10460: URL: https://github.com/apache/pinot/issues/10460
We were working through an incident today where we lost a number of realtime servers. When new servers came up, they were waiting up to 45 minutes to catch up before `/health` would return `OK`. Because they were taking up to 30 minutes to consume up to realtime, we decided to force commit consuming segments so they could just download the data from S3. For the most part, this worked great. But in one case, the server that was behind remained behind, and the other 2 healthy servers went from being 10 seconds of lag to reporting they were equally behind. My hypothesis is the segment commit protocol picked the unhealthy/lagging server to be the committer. I'm confident we saw the lag spike after the force commit, so there is definitely an issue here. But I will spend some time validating the actual mechanism by which this happened tomorrow. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org