dang-stripe opened a new issue, #10787:
URL: https://github.com/apache/pinot/issues/10787

   We've observed some 425 error query failures during rolling restarts on a 
relatively low QPS cluster. Looking at logs, we noticed that the server 
shutdown before the broker finished processing the routing table update. It 
doesn't seem as though the server is waiting the full 
`pinot.server.shutdown.noQueryThresholdMs` before shutting down the process 
fully.
   
   ```
   # server begins shutdown
   [2023-05-18 05:44:52.728337] INFO [BaseServerStarter] [Thread-41:17] 
Shutting down Pinot server
   [2023-05-18 05:44:52.747490] INFO [BaseServerStarter] [Thread-41:17] Sleep 
for 4608ms as there are still incoming queries (no query time: 10392ms is 
smaller than the threshold: 15000ms)
   
   # broker receives signal to remove server from routing table
   [2023-05-18 05:44:52.817685] INFO [BrokerRoutingManager] 
[ClusterChangeHandlingThread:25] Removing entry for server=Server1, 
table=Table1 
   
   # server stops quiescing after 4.6s
   [2023-05-18 05:44:57.355546] INFO [BaseServerStarter] [Thread-41:17] No 
query received within 15000ms (larger than the threshold: 15000ms), mark it as 
no incoming queries 
   [2023-05-18 05:44:57.355592] INFO [BaseServerStarter] [Thread-41:17] 
Finished draining queries after 4608ms
   
   # roughly the time when broker starts query
   [2023-05-18 05:45:00.671645] Caused by: java.net.ConnectException: 
Connection refused
   [2023-05-18 05:45:00.671634] 
org.apache.pinot.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException:
 Connection refused: Server1/10.20.30.40:8098
   [2023-05-18 05:45:00.671597] ERROR [QueryRouter] 
[jersey-server-managed-async-executor-788:25] Caught exception while sending 
request 55024 to server: Server1, marking query failed
   [2023-05-18 05:45:00.723279] INFO [QueryLogger] 
[jersey-server-managed-async-executor-788:25] 
requestId=55024,table=Table1,timeMs=490
   
   # broker finishes processing routing table change
   [2023-05-18 05:45:00.944494] INFO [BrokerRoutingManager] 
[ClusterChangeHandlingThread:25] Processed instance config change in 191ms 
(fetch 1040 instance configs: 68ms, calculate changed servers: 2ms, update 4 
routing entries: 121ms), new enabled servers: [], new disabled servers: 
[Server1], excluded servers: [Server1]
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

Reply via email to