Hi Rick, I quickly looked at GC logs and didn’t see obvious issues. You mentioned that batch processing takes ~20s and it is 500 documents. With 5-7 indexing thread it is ~150 documents/s. Are those big documents? With 200 queries/min (~3-4 queries/s - what sort of queries?) and 5-7 indexing threads, you might be overloading 4 cores. Do you have dedicated ZK nodes? Do you see the same issues with less indexing threads?
Regards, Emir -- Monitoring - Log Management - Alerting - Anomaly Detection Solr & Elasticsearch Consulting Support Training - http://sematext.com/ > On 4 Nov 2017, at 14:25, Rick Dig <teram...@gmail.com> wrote: > > not committing after the batch. made sure we have that turned off. > maxTime is set to 300000 (300 seconds), openSearcher is set to true. > > > On Sat, Nov 4, 2017 at 6:50 PM, Amrit Sarkar <sarkaramr...@gmail.com> wrote: > >> Pretty much what Emir has stated. I want to know, when you saw; >> >> all of this runs perfectly ok when indexing isn't happening. as soon as >>> we start "nrt" indexing one of the follower nodes goes down within 10 to >> 20 >>> minutes. >> >> >> When you say "NRT" indexing, what is the commit strategy in indexing. With >> auto-commit so highly set, are you committing after batch, if yes, what's >> the number. >> >> Amrit Sarkar >> Search Engineer >> Lucidworks, Inc. >> 415-589-9269 >> www.lucidworks.com >> Twitter http://twitter.com/lucidworks >> LinkedIn: https://www.linkedin.com/in/sarkaramrit2 >> >> On Sat, Nov 4, 2017 at 2:47 PM, Emir Arnautović < >> emir.arnauto...@sematext.com> wrote: >> >>> Hi Rick, >>> Do you see any errors in logs? Do you have any monitoring tool? Maybe you >>> can check heap and GC metrics around time when incident happened. It is >> not >>> large heap but some major GC could cause pause large enough to trigger >> some >>> snowball and end up with node in recovery state. >>> What is indexing rate you observe? Why do you have max warming searchers >> 5 >>> (did you mean this with autowarmingsearchers?) when you commit every 5 >> min? >>> Why did you increase it - you seen errors with default 2? Maybe you >> commit >>> every bulk? >>> Do you see similar behaviour when you just do indexing without queries? >>> >>> Thanks, >>> Emir >>> -- >>> Monitoring - Log Management - Alerting - Anomaly Detection >>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/ >>> >>> >>> >>>> On 4 Nov 2017, at 05:15, Rick Dig <teram...@gmail.com> wrote: >>>> >>>> hello all, >>>> we are trying to run solrcloud 6.6 in a production setting. >>>> here's our config and issue >>>> 1) 3 nodes, 1 shard, replication factor 3 >>>> 2) all nodes are 16GB RAM, 4 core >>>> 3) Our production load is about 2000 requests per minute >>>> 4) index is fairly small, index size is around 400 MB with 300k >> documents >>>> 5) autocommit is currently set to 5 minutes (even though ideally we >> would >>>> like a smaller interval). >>>> 6) the jvm runs with 8 gb Xms and Xmx with CMS gc. >>>> 7) all of this runs perfectly ok when indexing isn't happening. as soon >>> as >>>> we start "nrt" indexing one of the follower nodes goes down within 10 >> to >>> 20 >>>> minutes. from this point on the nodes never recover unless we stop >>>> indexing. the master usually is the last one to fall. >>>> 8) there are maybe 5 to 7 processes indexing at the same time with >>> document >>>> batch sizes of 500. >>>> 9) maxRambuffersizeMB is 100, autowarmingsearchers is 5, >>>> 10) no cpu and / or oom issues that we can see. >>>> 11) cpu load does go fairly high 15 to 20 at times. >>>> any help or pointers appreciated >>>> >>>> thanks >>>> rick >>> >>> >>