ankitsultana opened a new issue, #15997:
URL: https://github.com/apache/pinot/issues/15997

   **Note:** Creating this ticket for tracking. Some existing solutions might 
already suffice, but I want to ideally find a path forward where we don't have 
to turn any config or tune any knob to get resilience from basic server 
operations like restarts.
   
   While benchmarking MSE Lite Mode's resilience to server restarts, I found 
that even with Lite Mode, where only the leaf stages are run in the servers and 
everything else is run in the brokers, we were seeing high query failure rate 
during restart. The load used to test this was around 50 QPS with query 
latencies of ~50ms.
   
   You can see the attached graph below. The first wave of queries is with the 
regular MSE, and the second wave of queries is using the Lite Mode. While Lite 
Mode has a much lower failure rate than regular MSE, the failure rate is still 
quite high, and ideally we want to aim for zero failures due to server restarts.
   
   This experiment was run without any failure detector configured. See the 
section at the bottom.
   
   <img width="1097" alt="Image" 
src="https://github.com/user-attachments/assets/eb33e930-8c89-4ce3-97e6-ae601fa5cf1b";
 />
   
   # Errors Seen
   
   All the errors I saw were connection refused errors coming from the 
`QueryDispatcher`.
   
   ```
   Query processing exceptions: {200=QueryExecutionError: Error dispatching 
query: 1048745684000196581 to server: some-host@{port1,port2} 
org.apache.pinot.query.service.dispatch.QueryDispatcher.execute(QueryDispatcher.java:312)
 
org.apache.pinot.query.service.dispatch.QueryDispatcher.submit(QueryDispatcher.java:212)
 
org.apache.pinot.query.service.dispatch.QueryDispatcher.submitAndReduce(QueryDispatcher.java:149)
 
org.apache.pinot.broker.requesthandler.MultiStageBrokerRequestHandler.query(MultiStageBrokerRequestHandler.java:427)
 UNAVAILABLE: io exception io.grpc.Status.asRuntimeException(Status.java:535) 
io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:487)
 io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:563) 
io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:70) 
finishConnect(..) failed: Connection refused: some-host/111111:12345 
finishConnect(..) failed: Connection refused 
io.grpc.netty.shaded.io.netty.channel.unix.Err
 ors.newConnectException0(Errors.java:155) 
io.grpc.netty.shaded.io.netty.channel.unix.Errors.handleConnectErrno(Errors.java:128)
 
io.grpc.netty.shaded.io.netty.channel.unix.Socket.finishConnect(Socket.java:321)
 
io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:710)
 finishConnect(..) failed: Connection refused 
io.grpc.netty.shaded.io.netty.channel.unix.Errors.newConnectException0(Errors.java:155)
 
io.grpc.netty.shaded.io.netty.channel.unix.Errors.handleConnectErrno(Errors.java:128)
 
io.grpc.netty.shaded.io.netty.channel.unix.Socket.finishConnect(Socket.java:321)
 
io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:710)
 }
   ```
   
   # On Failure Detector
   
   Recent work by @yashmayya to add Failure Detector is specifically to tackle 
this scenario (I suppose), where exceptions from failed query dispatch are 
caught and fed into the routing manager logic, to exclude servers that are 
failing query dispatch temporarily with custom backoff and retry strategies.
   
   We'll be retrying the experiment with the Connection Based failure detector 
soon, but given this feature seems crucial to support basic reliability 
guarantees from the MSE, I think we should make this the default as soon as 
possible.
   
   # Related
   
   I had assumed that I wouldn't need anything like Failure Detector to get 
restart resilience. I'll look into the server start-up/shutdown process again 
to see if we have misconfigured something on our end.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

Reply via email to