Ah, I misunderstood your usecase - it is not node that receives query that OOMs but nodes that are included in distributed queries are the one that OOMs. I would also say that it is expected because queries to particular shards fails and coordinating node retries using other replicas causing all replicas to fail. I did not check the code, but I would expect to have some sort of retry mechanism in place.
HTH, Emir -- Monitoring - Log Management - Alerting - Anomaly Detection Solr & Elasticsearch Consulting Support Training - http://sematext.com/ > On 18 Dec 2017, at 15:36, Susheel Kumar <susheel2...@gmail.com> wrote: > > Yes, Emir. If I repeat the query, it will spread to other nodes but that's > not the case. This is my test env and i am deliberately executing the > query with very high offset and wildcard to cause OOM but executing only > one time. > > So it shouldn't spread to other replica sets and at the end of my test, > the first 6 shard/replica set's which gets hit should go down while other 6 > should survive but that's not what I see at the end. > > Setup : 400+ million docs, JVM is 12GB. Yes, only one collection. Total > 12 machines with 6 shards and 6 replica's (replicationFactor = 2) > > On Mon, Dec 18, 2017 at 9:22 AM, Emir Arnautović < > emir.arnauto...@sematext.com> wrote: > >> Hi Susheel, >> The fact that only node that received query OOM tells that it is about >> merging results from all shards and providing final result. It is expected >> that repeating the same query on some other node will result in a similar >> behaviour - it just mean that Solr does not have enough memory to execute >> this heavy query. >> Can you share more details on your test: size of collection, type of >> query, expected number of results, JVM settings, is that the only >> collection on cluster etc. >> >> Thanks, >> Emir >> -- >> Monitoring - Log Management - Alerting - Anomaly Detection >> Solr & Elasticsearch Consulting Support Training - http://sematext.com/ >> >> >> >>> On 18 Dec 2017, at 15:07, Susheel Kumar <susheel2...@gmail.com> wrote: >>> >>> Hello, >>> >>> I was testing Solr to see if a query which would cause OOM and would >> limit >>> the OOM issue to only the replica set's which gets hit first. >>> >>> But the behavior I see that after all set of first replica's went down >> due >>> to OOM (gone on cloud view) other replica's starts also getting down. >> Total >>> 6 shards I have with each shard having 2 replica's and on separate >> machines >>> >>> The expected behavior is that all shards replica which gets hit first >>> should go down due to OOM and then other replica's should survive and >>> provide High Availability. >>> >>> The setup I am testing with is Solr 6.0 and wondering if this is would >>> remain same with 6.6 or there has been some known improvements made to >>> avoid spreading OOM to second/third set of replica's and causing whole >>> cluster to down. >>> >>> Any info on this is appreciated. >>> >>> Thanks, >>> Susheel >> >>