siddharthteotia commented on issue #8618:
URL: https://github.com/apache/pinot/issues/8618#issuecomment-1127815429

   We have seen the following repeatedly in production:
   
   - A server slows down:
   --  because the queries already running there are taking up huge cpu / GC / 
   -- or the server indeed happens to be running a lot of  expensive queries 
   -- or something else is going on the server that is hogging cpu.
   -- In one of the cases queries were simply waiting in schedQueue and BPF 
profiler was taking up CPU
   -- Basically latencies observed from a server are spiking
   
   We mitigated this by disabling / enabling instance temporarily and forcing 
the broker to route queries to other replicas. Helped recover latency spikes 
that were coming from the slow server. This has happened so many times during 
oncall etc and almost always we circle back to the fact that "if there was 
single node/server resiliency, some of the thing could have been avoided or may 
be there were fewer alerts, latency spikes could have been less worse"
   
   We have also seen the following (this is the 2nd sub-problem outlined in the 
issue)
   
   - A slow query on the server takes up CPU and compromises the other queued 
up queries that could have finished faster. 
   - Eventually all queries (whether expensive or inexpensive) take up a lot of 
time.
   - There is no work stealing or scheduling -- there is no way to forestall a 
query based on priority / weight and "make slow queries run slower by 
scheduling them late / giving them few cores" and reduce the cascading impact 
of expensive queries on other queries.
   
   @vvivekiyer -  please add other notes on the background of problem from the 
concrete production examples. 
   
   The current design doc addresses the first problem as of now and we have a 
code / impl in progress that we are planning to test this week and also share 
the PR / discuss feedback etc on the approach.
   
   PS - @vvivekiyer  and I had sync'd up with @Jackie-Jiang  sometime back to 
tell we are working on it.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

Reply via email to