Hi Erick, Shawn, Thanks for following this up.
1, For some reason, ramBufferSizeMB in our solrconfig.xml is not set to 100MB, but 32MB. In that case, considering we have 10G for JVM, my understanding is we should not run out of memory due to large number of documents being added to Solr. Just to make sure I understand it correctly, the documents adding to Solr will be stored in an internal queue in Solr, and Solr will only use that 32MB (or 99% of 32M + one extra document memory) for indexing documents. The documents in the queue will be indexed one by one. 2, Based on our tomcat (Solr) access_log and website peak hours, the time we had our cluster failure is not likely because of _searching_traffic. Eg, we can see much more Solr requests with 'update' keyword, but as usual number of requests with 'select' keyword. 3, Now, this leads me to the only reason I can think of: (you mentioned this earlier as well): Since each Shard has 4 replicas in our setup, when there are large number of documents being add, the Leader will create a lot of threads to send the document to other replica servers. All these threads are the one consumed all the memory on Leader server, and leads to OOM. If my assumption was right, to try or fix this issue, is to: a): still need to limit the documents being add to Solr b): change to 2 replica for each shard (loss of data reliability, but..) c): bump up server memory. Am I going the right way? Any advice and suggestions are much appreciated!! Also attached part of catalina.out OOM log for reference: Exception in thread "http-bio-8983-exec-6571" java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:714) at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:949) at java.util.concurrent.ThreadPoolExecutor.processWorkerExit(ThreadPoolExecutor.java:1017) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1163) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) at java.lang.Thread.run(Thread.java:745) Exception in thread "http-bio-8983-exec-6861" java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:714) at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:949) at java.util.concurrent.ThreadPoolExecutor.processWorkerExit(ThreadPoolExecutor.java:1017) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1163) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) at java.lang.Thread.run(Thread.java:745) Exception in thread "http-bio-8983-exec-6671" java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:714) at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:949) at java.util.concurrent.ThreadPoolExecutor.processWorkerExit(ThreadPoolExecutor.java:1017) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1163) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) at java.lang.Thread.run(Thread.java:745) Many thanks, Tim -----Original Message----- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Saturday, 6 August 2016 2:31 AM To: solr-user Subject: Re: Solr Cloud with 5 servers cluster failed due to Leader out of memory You don't really have to worry that much about memory consumed during indexing. The ramBufferSizeMB setting in solrconfig.xml pretty much limits the amount of RAM consumed, when adding a doc if that limit is exceeded then the buffer is flushed. So you can reduce that number, but it's default is 100M and if you're running that close to your limits I suspect you'd get, at best, a bit more runway before you hit the problem again. NOTE: that number isn't an absolute limit, IIUC the algorithm is > index a doc to the in-memory structures check if the limit is exceeded > and flush if so. So say you were at 99% of your ramBufferSizeMB setting and then indexed a ginormous doc your in-memory stuff might be significantly bigger. Searching usually is the bigger RAM consumer, so when I say "a bit more runway" what I'm thinking about is that when you start _searching_ the data your memory requirements will continue to grow and you'll be back where you started. And just as a sanity check: You didn't perchance increase the maxWarmingSearchers parameter in solrconfig.xml, did you? If so, that's really a red flag. Best, Erick On Fri, Aug 5, 2016 at 12:41 AM, Tim Chen <tim.c...@sbs.com.au> wrote: > Thanks Guys. Very very helpful. > > I will probably look at consolidate 4 Solr servers into 2 bigger/better > server - it gives more memory, and it cut down the replica the Leader needs > to manage. > > Also, I may look into write a script to monitor the tomcat log and if there > is OOM, kill tomcat, then restart it. A bit dirty, but may work for a short > term. > > I don't know too much about how documents indexed, and how to save memory > from that. Will probably work with a developer on this as well. > > Many Thanks guys. > > Cheers, > Tim > > -----Original Message----- > From: Shawn Heisey [mailto:apa...@elyograg.org] > Sent: Friday, 5 August 2016 4:55 PM > To: solr-user@lucene.apache.org > Subject: Re: Solr Cloud with 5 servers cluster failed due to Leader > out of memory > > On 8/4/2016 8:14 PM, Tim Chen wrote: >> Couple of thoughts: 1, If Leader goes down, it should just go down, >> like dead down, so other servers can do the election and choose the >> new leader. This at least avoids bringing down the whole cluster. Am >> I right? > > Supplementing what Erick told you: > > When a typical Java program throws OutOfMemoryError, program behavior is > completely unpredictable. There are programming techniques that can be used > so that behavior IS predictable, but writing that code can be challenging. > > Solr 5.x and 6.x, when they are started on a UNIX/Linux system, use a Java > option to execute a script when OutOfMemoryError happens. This script kills > Solr completely. We are working on adding this capability when running on > Windows. > >> 2, Apparently we should not pushing too many documents to Solr, how >> do you guys handle this? Set a limit somewhere? > > There are exactly two ways to deal with OOME problems: Increase the heap or > reduce Solr's memory requirements. The number of documents you push to Solr > is unlikely to have a large effect on the amount of memory that Solr > requires. Here's some information on this topic: > > https://wiki.apache.org/solr/SolrPerformanceProblems#Java_Heap > > Thanks, > Shawn > > > > [Premiere League Starts Saturday 13 August 9.30pm on > SBS]<http://theworldgame.sbs.com.au/> [Premiere League Starts Saturday 13 August 9.30pm on SBS]<http://theworldgame.sbs.com.au/>