Check the leader and follower logs for anything like "leader initiated recovery" (LIR). One thing I have seen where followers go into recovery is if, for some reason, the time it takes to respond to an update exceeds the timeout. The scenario is this: > leader sends an update > follower fails to respond for _any_ reason within the timeout > leader says "sick follower, make it recover"
In the particular case I'm thinking of, indexing the packet took minutes. I strongly doubt that your documents are pathological enough to hit this, but there's at least a chance that the update are queueing up on the follower and the updates are timing out. Best, Erick On Sun, Nov 5, 2017 at 7:14 AM, Shawn Heisey <apa...@elyograg.org> wrote: > On 11/3/2017 10:15 PM, Rick Dig wrote: >> >> we are trying to run solrcloud 6.6 in a production setting. >> here's our config and issue >> 1) 3 nodes, 1 shard, replication factor 3 >> 2) all nodes are 16GB RAM, 4 core >> 3) Our production load is about 2000 requests per minute >> 4) index is fairly small, index size is around 400 MB with 300k documents >> 5) autocommit is currently set to 5 minutes (even though ideally we would >> like a smaller interval). >> 6) the jvm runs with 8 gb Xms and Xmx with CMS gc. >> 7) all of this runs perfectly ok when indexing isn't happening. as soon as >> we start "nrt" indexing one of the follower nodes goes down within 10 to >> 20 >> minutes. from this point on the nodes never recover unless we stop >> indexing. the master usually is the last one to fall. >> 8) there are maybe 5 to 7 processes indexing at the same time with >> document >> batch sizes of 500. >> 9) maxRambuffersizeMB is 100, autowarmingsearchers is 5, >> 10) no cpu and / or oom issues that we can see. >> 11) cpu load does go fairly high 15 to 20 at times. > > > My two cents to add to what you've already seen: > > With 300K documents and 400MB of index size, an 8GB heap seems very > excessive, even with complex queries. What evidence do you have that you > need a heap that size? Are you just following a best practice > recommendation you saw somewhere to give half your memory to Java? > > This is a *tiny* index by both document count and size. Each document > cannot be very big. > > Your GC log doesn't show any issues that concern me. There are a few slow > GCs, but when you index, that's probably to be expected, especially with an > 8GB heap. > > What exactly do you mean by "one of the follower nodes goes down"? When > this happens, are there error messages at the time of the event? What > symptoms are there pertaining to that specific node? > > A query load of 2000 per minute is about 33 per second. Are these queries > steady for the full minute, or is it bursty? 33 qps is high, but not > insane, and with such a tiny index, is probably well within Solr's > capabilities. > > There should be no reason to *ever* increase maxWarmingSearchers. If you > see the warning about this, the fix is to reduce your commit frequency, not > increase the value. Increasing the value can lead to memory and performance > problems. The fact that this value is even being discussed, and that the > value has been changed on your setup, has me thinking that there may be more > commits happening than the every-five-minute autocommit. > > For automatic commits, I have some recommendations for everyone to start > with, and then adjust if necessary: autoCommit: maxTime of 60000, > openSearcher false. autoSoftCommit, maxTime of 120000. Neither one should > have maxDocs configured. > > It should take far less than 20 seconds to index a 500 document batch, > especially when they are small enough for 300K of them to produce a 400MB > index. There are only a few problems I can imagine right now that could > cause such slow indexing, having no real information to go on: 1) The > analysis chains in your schema are exceptionally heavy and take a long time > to run. 2) There is a performance issue happening that we have not yet > figured out. 3) Your indexing request includes a commit, and the commit is > happening very slowly. > > Here is a log entry on one of my indexes showing 1000 documents being added > in 777 milliseconds. The index that this is happening on is about 40GB in > size, with about 30 million documents. I have redacted part of the > uniqueKey values in this log, to hide the sources of our data: > > 2017-11-04 09:30:14.325 INFO (qtp1394336709-42397) [ x:spark6live] > o.a.s.u.p.LogUpdateProcessorFactory [spark6live] webapp=/solr path=/update > params={wt=javabin&version=2}{add=[REDACTEDsix557224 (1583127266377859072), > REDACTEDsix557228 (1583127266381004800), REDACTEDtwo979483 > (1583127266381004801), REDACTEDtwo979488 (1583127266382053376), > REDACTEDtwo979490 (1583127266383101952), REDACTEDsix557260 > (1583127266383101953), REDACTEDsix557242 (1583127266384150528), > REDACTEDsix557258 (1583127266385199104), REDACTEDsix557247 > (1583127266385199105), REDACTEDsix557276 (1583127266394636288), ... (1000 > adds)]} 0 777 > > The rate I'm getting here of 1000 docs in 777 milliseconds is a rate that I > consider to be pretty slow, especially because my indexing is > single-threaded. But it works for us. What you're seeing where 500 > documents takes 20 seconds is slower than I've EVER seen, except in > situations where there's a serious problem. On a system in good health, > with multiple threads indexing, Solr should be able to index several > thousand documents every second. > > Is the indexing program running on the same machine as Solr, or on another > machine? For best results, it should be on a different machine, accessing > Solr via HTTP. This is so that whatever load the indexing program creates > does not take CPU, memory, and I/O resources away from Solr. > > What OS is Solr running on? If more information is needed, it will be a > good idea to know precisely how to gather that information. > > Overall, based on the information currently available, you should not be > having the problems you are. So there must be something about your setup > that's not configured correctly beyond the information we've already got. > It could be directly Solr-related, or something else indirectly causing > problems. I do not yet know exactly what information we might need to help. > > Can you share an entire solr.log file that covers enough time so that there > is both indexing and querying happening? If it also covers that node going > down, that would be even better. You'll probably need to use a file-sharing > website to share the log -- I'm surprised your GC log made it to the list. > > Thanks, > Shawn