jadami10 opened a new issue, #16565: URL: https://github.com/apache/pinot/issues/16565
We've been hit by a bug related to #14529 and external orchestration of Pinot restarts. Let's assume you have a table with 2 replica groups - External system sends SIGTERM to pinot-server-1 - pinot-server-1 sets `IS_SHUTDOWN_IN_PROGRESS - Pinot broker stops routing to pinot-server-1 - pinot-server-1 starts back up with `/health` not returning `OK` - pinot-server-1 [startupServiceStatusCheck](https://github.com/apache/pinot/blob/642bf00501ef0cc0ddb79ade00b2eff695590ea0/pinot-server/src/main/java/org/apache/pinot/server/starter/helix/BaseServerStarter.java#L150) completes. - *start of problem*: External system seems `/health` return OK - *problem*: External system restarts pinot-server-2 - *problem*: Queries fail because `pinot-server-1` and `pinot-server-2` are both not serving queries - `pinot-server-1` sets `IS_SHUTDOWN_IN_PROGRESS` false - Broker adds `pinot-server-1` back to the routing table, and queries succeed again In our case, this caused ~17 seconds of down time. It's not clear how to orchestrate this correctly in Pinot. It seems you have to check the broker routing table for every table to ensure your server is found in there. But there's no clear API for "Is X server available for all necessary segments" or "is Y server going to cause downtime if I take it down". So if you're performing a rolling restart, you're kind of crossing your fingers you wait long enough between replica group restarts. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
