siddharthteotia commented on issue #8618: URL: https://github.com/apache/pinot/issues/8618#issuecomment-1127815429
We have seen the following repeatedly in production: - A server slows down: -- because the queries already running there are taking up huge cpu / GC / -- or the server indeed happens to be running a lot of expensive queries -- or something else is going on the server that is hogging cpu. -- In one of the cases queries were simply waiting in schedQueue and BPF profiler was taking up CPU -- Basically latencies observed from a server are spiking We mitigated this by disabling / enabling instance temporarily and forcing the broker to route queries to other replicas. Helped recover latency spikes that were coming from the slow server. This has happened so many times during oncall etc and almost always we circle back to the fact that "if there was single node/server resiliency, some of the thing could have been avoided or may be there were fewer alerts, latency spikes could have been less worse" We have also seen the following (this is the 2nd sub-problem outlined in the issue) - A slow query on the server takes up CPU and compromises the other queued up queries that could have finished faster. - Eventually all queries (whether expensive or inexpensive) take up a lot of time. - There is no work stealing or scheduling -- there is no way to forestall a query based on priority / weight and "make slow queries run slower by scheduling them late / giving them few cores" and reduce the cascading impact of expensive queries on other queries. @vvivekiyer - please add other notes on the background of problem from the concrete production examples. The current design doc addresses the first problem as of now and we have a code / impl in progress that we are planning to test this week and also share the PR / discuss feedback etc on the approach. PS - @vvivekiyer and I had sync'd up with @Jackie-Jiang sometime back to tell we are working on it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org For additional commands, e-mail: commits-h...@pinot.apache.org