Ah, I misunderstood your usecase - it is not node that receives query that OOMs 
but nodes that are included in distributed queries are the one that OOMs. I 
would also say that it is expected because queries to particular shards fails 
and coordinating node retries using other replicas causing all replicas to 
fail. I did not check the code, but I would expect to have some sort of retry 
mechanism in place.

HTH,
Emir
--
Monitoring - Log Management - Alerting - Anomaly Detection
Solr & Elasticsearch Consulting Support Training - http://sematext.com/



> On 18 Dec 2017, at 15:36, Susheel Kumar <susheel2...@gmail.com> wrote:
> 
> Yes, Emir.  If I repeat the query, it will spread to other nodes but that's
> not the case.  This is my test env and i am deliberately executing the
> query with very high offset and wildcard to cause OOM but executing only
> one time.
> 
> So it shouldn't spread to other replica sets and at the end of my test,
> the first 6 shard/replica set's which gets hit should go down while other 6
> should survive but that's not what I see at the end.
> 
> Setup :  400+ million docs, JVM is 12GB.  Yes, only one collection. Total
> 12 machines with 6 shards and 6 replica's (replicationFactor = 2)
> 
> On Mon, Dec 18, 2017 at 9:22 AM, Emir Arnautović <
> emir.arnauto...@sematext.com> wrote:
> 
>> Hi Susheel,
>> The fact that only node that received query OOM tells that it is about
>> merging results from all shards and providing final result. It is expected
>> that repeating the same query on some other node will result in a similar
>> behaviour - it just mean that Solr does not have enough memory to execute
>> this heavy query.
>> Can you share more details on your test: size of collection, type of
>> query, expected number of results, JVM settings, is that the only
>> collection on cluster etc.
>> 
>> Thanks,
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>> 
>> 
>> 
>>> On 18 Dec 2017, at 15:07, Susheel Kumar <susheel2...@gmail.com> wrote:
>>> 
>>> Hello,
>>> 
>>> I was testing Solr to see if a query which would cause OOM and would
>> limit
>>> the OOM issue to only the replica set's which gets hit first.
>>> 
>>> But the behavior I see that after all set of first replica's went down
>> due
>>> to OOM (gone on cloud view) other replica's starts also getting down.
>> Total
>>> 6 shards I have with each shard having 2 replica's and on separate
>> machines
>>> 
>>> The expected behavior is that all shards replica which gets hit first
>>> should go down due to OOM and then other replica's should survive and
>>> provide High Availability.
>>> 
>>> The setup I am testing with is Solr 6.0 and wondering if this is would
>>> remain same with 6.6 or there has been some known improvements made to
>>> avoid spreading OOM to second/third set of replica's and causing whole
>>> cluster to down.
>>> 
>>> Any info on this is appreciated.
>>> 
>>> Thanks,
>>> Susheel
>> 
>> 

Reply via email to