If it's a thread and you have plenty of RAM and the heap is fine, have you checked raising OS thread limits?
- Mark On Tue, Oct 6, 2015 at 4:54 PM Rallavagu <rallav...@gmail.com> wrote: > GC logging shows normal. The "OutOfMemoryError" appears to be pertaining > to a thread but not to JVM. > > On 10/6/15 1:07 PM, Mark Miller wrote: > > That amount of RAM can easily be eaten up depending on your sorting, > > faceting, data. > > > > Do you have gc logging enabled? That should describe what is happening > with > > the heap. > > > > - Mark > > > > On Tue, Oct 6, 2015 at 4:04 PM Rallavagu <rallav...@gmail.com> wrote: > > > >> Mark - currently 5.3 is being evaluated for upgrade purposes and > >> hopefully get there sooner. Meanwhile, following exception is noted from > >> logs during updates > >> > >> ERROR org.apache.solr.update.CommitTracker – auto commit > >> error...:java.lang.IllegalStateException: this writer hit an > >> OutOfMemoryError; cannot commit > >> at > >> > >> > org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2807) > >> at > >> > org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2984) > >> at > >> > >> > org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:559) > >> at > >> org.apache.solr.update.CommitTracker.run(CommitTracker.java:216) > >> at > >> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:440) > >> at > >> > >> > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:98) > >> at > >> > >> > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:206) > >> at > >> > >> > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:896) > >> at > >> > >> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:919) > >> at java.lang.Thread.run(Thread.java:682) > >> > >> Considering the fact that the machine is configured with 48G (24G for > >> JVM which will be reduced in future) wondering how would it still go out > >> of memory. For memory mapped index files the remaining 24G or what is > >> available off of it should be available. Looking at the lsof output the > >> memory mapped files were around 10G. > >> > >> Thanks. > >> > >> > >> On 10/5/15 5:41 PM, Mark Miller wrote: > >>> I'd make two guess: > >>> > >>> Looks like you are using Jrocket? I don't think that is common or well > >>> tested at this point. > >>> > >>> There are a billion or so bug fixes from 4.6.1 to 5.3.2. Given the pace > >> of > >>> SolrCloud, you are dealing with something fairly ancient and so it will > >> be > >>> harder to find help with older issues most likely. > >>> > >>> - Mark > >>> > >>> On Mon, Oct 5, 2015 at 12:46 PM Rallavagu <rallav...@gmail.com> wrote: > >>> > >>>> Any takers on this? Any kinda clue would help. Thanks. > >>>> > >>>> On 10/4/15 10:14 AM, Rallavagu wrote: > >>>>> As there were no responses so far, I assume that this is not a very > >>>>> common issue that folks come across. So, I went into source (4.6.1) > to > >>>>> see if I can figure out what could be the cause. > >>>>> > >>>>> > >>>>> The thread that is locking is in this block of code > >>>>> > >>>>> synchronized (recoveryLock) { > >>>>> // to be air tight we must also check after lock > >>>>> if (cc.isShutDown()) { > >>>>> log.warn("Skipping recovery because Solr is shutdown"); > >>>>> return; > >>>>> } > >>>>> log.info("Running recovery - first canceling any ongoing > >>>> recovery"); > >>>>> cancelRecovery(); > >>>>> > >>>>> while (recoveryRunning) { > >>>>> try { > >>>>> recoveryLock.wait(1000); > >>>>> } catch (InterruptedException e) { > >>>>> > >>>>> } > >>>>> // check again for those that were waiting > >>>>> if (cc.isShutDown()) { > >>>>> log.warn("Skipping recovery because Solr is shutdown"); > >>>>> return; > >>>>> } > >>>>> if (closed) return; > >>>>> } > >>>>> > >>>>> Subsequently, the thread will get into cancelRecovery method as > below, > >>>>> > >>>>> public void cancelRecovery() { > >>>>> synchronized (recoveryLock) { > >>>>> if (recoveryStrat != null && recoveryRunning) { > >>>>> recoveryStrat.close(); > >>>>> while (true) { > >>>>> try { > >>>>> recoveryStrat.join(); > >>>>> } catch (InterruptedException e) { > >>>>> // not interruptible - keep waiting > >>>>> continue; > >>>>> } > >>>>> break; > >>>>> } > >>>>> > >>>>> recoveryRunning = false; > >>>>> recoveryLock.notifyAll(); > >>>>> } > >>>>> } > >>>>> } > >>>>> > >>>>> As per the stack trace "recoveryStrat.join()" is where things are > >>>>> holding up. > >>>>> > >>>>> I wonder why/how cancelRecovery would take time so around 870 threads > >>>>> would be waiting on. Is it possible that ZK is not responding or > >>>>> something else like Operating System resources could cause this? > >> Thanks. > >>>>> > >>>>> > >>>>> On 10/2/15 4:17 PM, Rallavagu wrote: > >>>>>> Here is the stack trace of the thread that is holding the lock. > >>>>>> > >>>>>> > >>>>>> "Thread-55266" id=77142 idx=0xc18 tid=992 prio=5 alive, waiting, > >>>>>> native_blocked, daemon > >>>>>> -- Waiting for notification on: > >>>>>> org/apache/solr/cloud/RecoveryStrategy@0x3f34e8480[fat lock] > >>>>>> at pthread_cond_wait@@GLIBC_2.3.2+202(:0)@0x3d4180b5ba > >>>>>> at > eventTimedWaitNoTransitionImpl+71(event.c:90)@0x7ff3133b6ba8 > >>>>>> at > >>>>>> > syncWaitForSignalNoTransition+65(synchronization.c:28)@0x7ff31354a0b2 > >>>>>> at syncWaitForSignal+189(synchronization.c:85)@0x7ff31354a20e > >>>>>> at > syncWaitForJavaSignal+38(synchronization.c:93)@0x7ff31354a327 > >>>>>> at > >>>>>> > >>>> > >> > RJNI_jrockit_vm_Threads_waitForNotifySignal+73(rnithreads.c:72)@0x7ff31351939a > >>>>>> > >>>>>> > >>>>>> at > >>>>>> jrockit/vm/Threads.waitForNotifySignal(JLjava/lang/Object;)Z(Native > >>>>>> Method) > >>>>>> at java/lang/Object.wait(J)V(Native Method) > >>>>>> at java/lang/Thread.join(Thread.java:1206) > >>>>>> ^-- Lock released while waiting: > >>>>>> org/apache/solr/cloud/RecoveryStrategy@0x3f34e8480[fat lock] > >>>>>> at java/lang/Thread.join(Thread.java:1259) > >>>>>> at > >>>>>> > >>>> > >> > org/apache/solr/update/DefaultSolrCoreState.cancelRecovery(DefaultSolrCoreState.java:331) > >>>>>> > >>>>>> > >>>>>> ^-- Holding lock: java/lang/Object@0x114d8dd00[recursive] > >>>>>> at > >>>>>> > >>>> > >> > org/apache/solr/update/DefaultSolrCoreState.doRecovery(DefaultSolrCoreState.java:297) > >>>>>> > >>>>>> > >>>>>> ^-- Holding lock: java/lang/Object@0x114d8dd00[fat lock] > >>>>>> at > >>>>>> > >>>> > >> > org/apache/solr/handler/admin/CoreAdminHandler$2.run(CoreAdminHandler.java:770) > >>>>>> > >>>>>> > >>>>>> at jrockit/vm/RNI.c2java(JJJJJ)V(Native Method) > >>>>>> > >>>>>> > >>>>>> Stack trace of one of the 870 threads that is waiting for the lock > to > >> be > >>>>>> released. > >>>>>> > >>>>>> "Thread-55489" id=77520 idx=0xebc tid=1494 prio=5 alive, blocked, > >>>>>> native_blocked, daemon > >>>>>> -- Blocked trying to get lock: java/lang/Object@0x114d8dd00 > [fat > >>>>>> lock] > >>>>>> at pthread_cond_wait@@GLIBC_2.3.2+202(:0)@0x3d4180b5ba > >>>>>> at > eventTimedWaitNoTransitionImpl+71(event.c:90)@0x7ff3133b6ba8 > >>>>>> at > >>>>>> > syncWaitForSignalNoTransition+65(synchronization.c:28)@0x7ff31354a0b2 > >>>>>> at syncWaitForSignal+189(synchronization.c:85)@0x7ff31354a20e > >>>>>> at > syncWaitForJavaSignal+38(synchronization.c:93)@0x7ff31354a327 > >>>>>> at jrockit/vm/Threads.waitForUnblockSignal()V(Native Method) > >>>>>> at > >> jrockit/vm/Locks.fatLockBlockOrSpin(Locks.java:1411)[optimized] > >>>>>> at jrockit/vm/Locks.lockFat(Locks.java:1512)[optimized] > >>>>>> at > >>>>>> > >> jrockit/vm/Locks.monitorEnterSecondStageHard(Locks.java:1054)[optimized] > >>>>>> at > >>>>>> jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1005)[optimized] > >>>>>> at jrockit/vm/Locks.monitorEnter(Locks.java:2179)[optimized] > >>>>>> at > >>>>>> > >>>> > >> > org/apache/solr/update/DefaultSolrCoreState.doRecovery(DefaultSolrCoreState.java:290) > >>>>>> > >>>>>> > >>>>>> at > >>>>>> > >>>> > >> > org/apache/solr/handler/admin/CoreAdminHandler$2.run(CoreAdminHandler.java:770) > >>>>>> > >>>>>> > >>>>>> at jrockit/vm/RNI.c2java(JJJJJ)V(Native Method) > >>>>>> > >>>>>> On 10/2/15 4:12 PM, Rallavagu wrote: > >>>>>>> Solr 4.6.1 on Tomcat 7, single shard 4 node cloud with 3 node > >> zookeeper > >>>>>>> > >>>>>>> During updates, some nodes are going very high cpu and becomes > >>>>>>> unavailable. The thread dump shows the following thread is blocked > >> 870 > >>>>>>> threads which explains high CPU. Any clues on where to look? > >>>>>>> > >>>>>>> "Thread-56848" id=79207 idx=0x38 tid=3169 prio=5 alive, blocked, > >>>>>>> native_blocked, daemon > >>>>>>> -- Blocked trying to get lock: java/lang/Object@0x114d8dd00 > >> [fat > >>>>>>> lock] > >>>>>>> at pthread_cond_wait@@GLIBC_2.3.2+202(:0)@0x3d4180b5ba > >>>>>>> at > eventTimedWaitNoTransitionImpl+71(event.c:90)@0x7ff3133b6ba8 > >>>>>>> at > >>>>>>> > syncWaitForSignalNoTransition+65(synchronization.c:28)@0x7ff31354a0b2 > >>>>>>> at > syncWaitForSignal+189(synchronization.c:85)@0x7ff31354a20e > >>>>>>> at > >> syncWaitForJavaSignal+38(synchronization.c:93)@0x7ff31354a327 > >>>>>>> at jrockit/vm/Threads.waitForUnblockSignal()V(Native Method) > >>>>>>> at > >> jrockit/vm/Locks.fatLockBlockOrSpin(Locks.java:1411)[optimized] > >>>>>>> at jrockit/vm/Locks.lockFat(Locks.java:1512)[optimized] > >>>>>>> at > >>>>>>> > >>>> > jrockit/vm/Locks.monitorEnterSecondStageHard(Locks.java:1054)[optimized] > >>>>>>> at > >>>>>>> > jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1005)[optimized] > >>>>>>> at jrockit/vm/Locks.monitorEnter(Locks.java:2179)[optimized] > >>>>>>> at > >>>>>>> > >>>> > >> > org/apache/solr/update/DefaultSolrCoreState.doRecovery(DefaultSolrCoreState.java:290) > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> at > >>>>>>> > >>>> > >> > org/apache/solr/handler/admin/CoreAdminHandler$2.run(CoreAdminHandler.java:770) > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> at jrockit/vm/RNI.c2java(JJJJJ)V(Native Method) > >>>> > >> > -- - Mark about.me/markrmiller