rhodo opened a new pull request, #15722: URL: https://github.com/apache/pinot/pull/15722
During large table rebalances, a massive number of state transitions may be triggered. If a server cannot keep up, the size of its Helix message queue can grow significantly. This PR adds visibility into the server-side Helix message queue size. Some rationale: - This PR delegates responsibility to each server instance to monitor and log its own message queue size metrics, instead of relying on the controller. - It decouples the getHelixServerMessageCount() method from the metrics scraping thread. This ensures that: - The frequency of metrics scraping does not introduce additional I/O pressure on ZooKeeper. - ZooKeeper I/O latency do not interfere with the metrics scraping process. ## Test In quickstart trigger segment reload, meanwhile intentionally block segment reload handler in server, then observing the queue size bump from 0 -> 1, after let segment reload go through, saw metric go back to 0  -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org