Shawn/Emir - its the Java heap space issue. I can see in GCViewer sudden heap utilization and finally Full GC lines and oom killer script killing the solr.
What I wonder is if there is retry from coordinating node which is causing this OOM query to spread to next set of replica's then how can we tune / change this behavior. Otherwise even though we have higher replication factor > 1 but still HA is not guaranteed in this situation which defeats the purpose... If we can't control this retry by coordinating node then I would say we have something fundamental wrong. I know "timeAllowed" may save us in some of these scenario's but if OOM happens before "timeAllowed"+extraTime (it takes to really kill the query) we still have the issue. Any thoughts on how one can provide HA in these situations. Thanks, Susheel On Mon, Dec 18, 2017 at 9:53 AM, Emir Arnautović < emir.arnauto...@sematext.com> wrote: > Ah, I misunderstood your usecase - it is not node that receives query that > OOMs but nodes that are included in distributed queries are the one that > OOMs. I would also say that it is expected because queries to particular > shards fails and coordinating node retries using other replicas causing all > replicas to fail. I did not check the code, but I would expect to have some > sort of retry mechanism in place. > > HTH, > Emir > -- > Monitoring - Log Management - Alerting - Anomaly Detection > Solr & Elasticsearch Consulting Support Training - http://sematext.com/ > > > > > On 18 Dec 2017, at 15:36, Susheel Kumar <susheel2...@gmail.com> wrote: > > > > Yes, Emir. If I repeat the query, it will spread to other nodes but > that's > > not the case. This is my test env and i am deliberately executing the > > query with very high offset and wildcard to cause OOM but executing only > > one time. > > > > So it shouldn't spread to other replica sets and at the end of my test, > > the first 6 shard/replica set's which gets hit should go down while > other 6 > > should survive but that's not what I see at the end. > > > > Setup : 400+ million docs, JVM is 12GB. Yes, only one collection. Total > > 12 machines with 6 shards and 6 replica's (replicationFactor = 2) > > > > On Mon, Dec 18, 2017 at 9:22 AM, Emir Arnautović < > > emir.arnauto...@sematext.com> wrote: > > > >> Hi Susheel, > >> The fact that only node that received query OOM tells that it is about > >> merging results from all shards and providing final result. It is > expected > >> that repeating the same query on some other node will result in a > similar > >> behaviour - it just mean that Solr does not have enough memory to > execute > >> this heavy query. > >> Can you share more details on your test: size of collection, type of > >> query, expected number of results, JVM settings, is that the only > >> collection on cluster etc. > >> > >> Thanks, > >> Emir > >> -- > >> Monitoring - Log Management - Alerting - Anomaly Detection > >> Solr & Elasticsearch Consulting Support Training - http://sematext.com/ > >> > >> > >> > >>> On 18 Dec 2017, at 15:07, Susheel Kumar <susheel2...@gmail.com> wrote: > >>> > >>> Hello, > >>> > >>> I was testing Solr to see if a query which would cause OOM and would > >> limit > >>> the OOM issue to only the replica set's which gets hit first. > >>> > >>> But the behavior I see that after all set of first replica's went down > >> due > >>> to OOM (gone on cloud view) other replica's starts also getting down. > >> Total > >>> 6 shards I have with each shard having 2 replica's and on separate > >> machines > >>> > >>> The expected behavior is that all shards replica which gets hit first > >>> should go down due to OOM and then other replica's should survive and > >>> provide High Availability. > >>> > >>> The setup I am testing with is Solr 6.0 and wondering if this is would > >>> remain same with 6.6 or there has been some known improvements made to > >>> avoid spreading OOM to second/third set of replica's and causing whole > >>> cluster to down. > >>> > >>> Any info on this is appreciated. > >>> > >>> Thanks, > >>> Susheel > >> > >> > >