Re: Recovery Thread Blocked

Mark Miller Tue, 06 Oct 2015 18:22:31 -0700

If it's a thread and you have plenty of RAM and the heap is fine, have you
checked raising OS thread limits?


- Mark

On Tue, Oct 6, 2015 at 4:54 PM Rallavagu <rallav...@gmail.com> wrote:

> GC logging shows normal. The "OutOfMemoryError" appears to be pertaining
> to a thread but not to JVM.
>
> On 10/6/15 1:07 PM, Mark Miller wrote:
> > That amount of RAM can easily be eaten up depending on your sorting,
> > faceting, data.
> >
> > Do you have gc logging enabled? That should describe what is happening
> with
> > the heap.
> >
> > - Mark
> >
> > On Tue, Oct 6, 2015 at 4:04 PM Rallavagu <rallav...@gmail.com> wrote:
> >
> >> Mark - currently 5.3 is being evaluated for upgrade purposes and
> >> hopefully get there sooner. Meanwhile, following exception is noted from
> >> logs during updates
> >>
> >> ERROR org.apache.solr.update.CommitTracker  – auto commit
> >> error...:java.lang.IllegalStateException: this writer hit an
> >> OutOfMemoryError; cannot commit
> >>           at
> >>
> >>
> org.apache.lucene.index.IndexWriter.prepareCommitInternal(IndexWriter.java:2807)
> >>           at
> >>
> org.apache.lucene.index.IndexWriter.commitInternal(IndexWriter.java:2984)
> >>           at
> >>
> >>
> org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:559)
> >>           at
> >> org.apache.solr.update.CommitTracker.run(CommitTracker.java:216)
> >>           at
> >> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:440)
> >>           at
> >>
> >>
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:98)
> >>           at
> >>
> >>
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:206)
> >>           at
> >>
> >>
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:896)
> >>           at
> >>
> >>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:919)
> >>           at java.lang.Thread.run(Thread.java:682)
> >>
> >> Considering the fact that the machine is configured with 48G (24G for
> >> JVM which will be reduced in future) wondering how would it still go out
> >> of memory. For memory mapped index files the remaining 24G or what is
> >> available off of it should be available. Looking at the lsof output the
> >> memory mapped files were around 10G.
> >>
> >> Thanks.
> >>
> >>
> >> On 10/5/15 5:41 PM, Mark Miller wrote:
> >>> I'd make two guess:
> >>>
> >>> Looks like you are using Jrocket? I don't think that is common or well
> >>> tested at this point.
> >>>
> >>> There are a billion or so bug fixes from 4.6.1 to 5.3.2. Given the pace
> >> of
> >>> SolrCloud, you are dealing with something fairly ancient and so it will
> >> be
> >>> harder to find help with older issues most likely.
> >>>
> >>> - Mark
> >>>
> >>> On Mon, Oct 5, 2015 at 12:46 PM Rallavagu <rallav...@gmail.com> wrote:
> >>>
> >>>> Any takers on this? Any kinda clue would help. Thanks.
> >>>>
> >>>> On 10/4/15 10:14 AM, Rallavagu wrote:
> >>>>> As there were no responses so far, I assume that this is not a very
> >>>>> common issue that folks come across. So, I went into source (4.6.1)
> to
> >>>>> see if I can figure out what could be the cause.
> >>>>>
> >>>>>
> >>>>> The thread that is locking is in this block of code
> >>>>>
> >>>>> synchronized (recoveryLock) {
> >>>>>          // to be air tight we must also check after lock
> >>>>>          if (cc.isShutDown()) {
> >>>>>            log.warn("Skipping recovery because Solr is shutdown");
> >>>>>            return;
> >>>>>          }
> >>>>>          log.info("Running recovery - first canceling any ongoing
> >>>> recovery");
> >>>>>          cancelRecovery();
> >>>>>
> >>>>>          while (recoveryRunning) {
> >>>>>            try {
> >>>>>              recoveryLock.wait(1000);
> >>>>>            } catch (InterruptedException e) {
> >>>>>
> >>>>>            }
> >>>>>            // check again for those that were waiting
> >>>>>            if (cc.isShutDown()) {
> >>>>>              log.warn("Skipping recovery because Solr is shutdown");
> >>>>>              return;
> >>>>>            }
> >>>>>            if (closed) return;
> >>>>>          }
> >>>>>
> >>>>> Subsequently, the thread will get into cancelRecovery method as
> below,
> >>>>>
> >>>>> public void cancelRecovery() {
> >>>>>        synchronized (recoveryLock) {
> >>>>>          if (recoveryStrat != null && recoveryRunning) {
> >>>>>            recoveryStrat.close();
> >>>>>            while (true) {
> >>>>>              try {
> >>>>>                recoveryStrat.join();
> >>>>>              } catch (InterruptedException e) {
> >>>>>                // not interruptible - keep waiting
> >>>>>                continue;
> >>>>>              }
> >>>>>              break;
> >>>>>            }
> >>>>>
> >>>>>            recoveryRunning = false;
> >>>>>            recoveryLock.notifyAll();
> >>>>>          }
> >>>>>        }
> >>>>>      }
> >>>>>
> >>>>> As per the stack trace "recoveryStrat.join()" is where things are
> >>>>> holding up.
> >>>>>
> >>>>> I wonder why/how cancelRecovery would take time so around 870 threads
> >>>>> would be waiting on. Is it possible that ZK is not responding or
> >>>>> something else like Operating System resources could cause this?
> >> Thanks.
> >>>>>
> >>>>>
> >>>>> On 10/2/15 4:17 PM, Rallavagu wrote:
> >>>>>> Here is the stack trace of the thread that is holding the lock.
> >>>>>>
> >>>>>>
> >>>>>> "Thread-55266" id=77142 idx=0xc18 tid=992 prio=5 alive, waiting,
> >>>>>> native_blocked, daemon
> >>>>>>        -- Waiting for notification on:
> >>>>>> org/apache/solr/cloud/RecoveryStrategy@0x3f34e8480[fat lock]
> >>>>>>        at pthread_cond_wait@@GLIBC_2.3.2+202(:0)@0x3d4180b5ba
> >>>>>>        at
> eventTimedWaitNoTransitionImpl+71(event.c:90)@0x7ff3133b6ba8
> >>>>>>        at
> >>>>>>
> syncWaitForSignalNoTransition+65(synchronization.c:28)@0x7ff31354a0b2
> >>>>>>        at syncWaitForSignal+189(synchronization.c:85)@0x7ff31354a20e
> >>>>>>        at
> syncWaitForJavaSignal+38(synchronization.c:93)@0x7ff31354a327
> >>>>>>        at
> >>>>>>
> >>>>
> >>
> RJNI_jrockit_vm_Threads_waitForNotifySignal+73(rnithreads.c:72)@0x7ff31351939a
> >>>>>>
> >>>>>>
> >>>>>>        at
> >>>>>> jrockit/vm/Threads.waitForNotifySignal(JLjava/lang/Object;)Z(Native
> >>>>>> Method)
> >>>>>>        at java/lang/Object.wait(J)V(Native Method)
> >>>>>>        at java/lang/Thread.join(Thread.java:1206)
> >>>>>>        ^-- Lock released while waiting:
> >>>>>> org/apache/solr/cloud/RecoveryStrategy@0x3f34e8480[fat lock]
> >>>>>>        at java/lang/Thread.join(Thread.java:1259)
> >>>>>>        at
> >>>>>>
> >>>>
> >>
> org/apache/solr/update/DefaultSolrCoreState.cancelRecovery(DefaultSolrCoreState.java:331)
> >>>>>>
> >>>>>>
> >>>>>>        ^-- Holding lock: java/lang/Object@0x114d8dd00[recursive]
> >>>>>>        at
> >>>>>>
> >>>>
> >>
> org/apache/solr/update/DefaultSolrCoreState.doRecovery(DefaultSolrCoreState.java:297)
> >>>>>>
> >>>>>>
> >>>>>>        ^-- Holding lock: java/lang/Object@0x114d8dd00[fat lock]
> >>>>>>        at
> >>>>>>
> >>>>
> >>
> org/apache/solr/handler/admin/CoreAdminHandler$2.run(CoreAdminHandler.java:770)
> >>>>>>
> >>>>>>
> >>>>>>        at jrockit/vm/RNI.c2java(JJJJJ)V(Native Method)
> >>>>>>
> >>>>>>
> >>>>>> Stack trace of one of the 870 threads that is waiting for the lock
> to
> >> be
> >>>>>> released.
> >>>>>>
> >>>>>> "Thread-55489" id=77520 idx=0xebc tid=1494 prio=5 alive, blocked,
> >>>>>> native_blocked, daemon
> >>>>>>        -- Blocked trying to get lock: java/lang/Object@0x114d8dd00
> [fat
> >>>>>> lock]
> >>>>>>        at pthread_cond_wait@@GLIBC_2.3.2+202(:0)@0x3d4180b5ba
> >>>>>>        at
> eventTimedWaitNoTransitionImpl+71(event.c:90)@0x7ff3133b6ba8
> >>>>>>        at
> >>>>>>
> syncWaitForSignalNoTransition+65(synchronization.c:28)@0x7ff31354a0b2
> >>>>>>        at syncWaitForSignal+189(synchronization.c:85)@0x7ff31354a20e
> >>>>>>        at
> syncWaitForJavaSignal+38(synchronization.c:93)@0x7ff31354a327
> >>>>>>        at jrockit/vm/Threads.waitForUnblockSignal()V(Native Method)
> >>>>>>        at
> >> jrockit/vm/Locks.fatLockBlockOrSpin(Locks.java:1411)[optimized]
> >>>>>>        at jrockit/vm/Locks.lockFat(Locks.java:1512)[optimized]
> >>>>>>        at
> >>>>>>
> >> jrockit/vm/Locks.monitorEnterSecondStageHard(Locks.java:1054)[optimized]
> >>>>>>        at
> >>>>>> jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1005)[optimized]
> >>>>>>        at jrockit/vm/Locks.monitorEnter(Locks.java:2179)[optimized]
> >>>>>>        at
> >>>>>>
> >>>>
> >>
> org/apache/solr/update/DefaultSolrCoreState.doRecovery(DefaultSolrCoreState.java:290)
> >>>>>>
> >>>>>>
> >>>>>>        at
> >>>>>>
> >>>>
> >>
> org/apache/solr/handler/admin/CoreAdminHandler$2.run(CoreAdminHandler.java:770)
> >>>>>>
> >>>>>>
> >>>>>>        at jrockit/vm/RNI.c2java(JJJJJ)V(Native Method)
> >>>>>>
> >>>>>> On 10/2/15 4:12 PM, Rallavagu wrote:
> >>>>>>> Solr 4.6.1 on Tomcat 7, single shard 4 node cloud with 3 node
> >> zookeeper
> >>>>>>>
> >>>>>>> During updates, some nodes are going very high cpu and becomes
> >>>>>>> unavailable. The thread dump shows the following thread is blocked
> >> 870
> >>>>>>> threads which explains high CPU. Any clues on where to look?
> >>>>>>>
> >>>>>>> "Thread-56848" id=79207 idx=0x38 tid=3169 prio=5 alive, blocked,
> >>>>>>> native_blocked, daemon
> >>>>>>>        -- Blocked trying to get lock: java/lang/Object@0x114d8dd00
> >> [fat
> >>>>>>> lock]
> >>>>>>>        at pthread_cond_wait@@GLIBC_2.3.2+202(:0)@0x3d4180b5ba
> >>>>>>>        at
> eventTimedWaitNoTransitionImpl+71(event.c:90)@0x7ff3133b6ba8
> >>>>>>>        at
> >>>>>>>
> syncWaitForSignalNoTransition+65(synchronization.c:28)@0x7ff31354a0b2
> >>>>>>>        at
> syncWaitForSignal+189(synchronization.c:85)@0x7ff31354a20e
> >>>>>>>        at
> >> syncWaitForJavaSignal+38(synchronization.c:93)@0x7ff31354a327
> >>>>>>>        at jrockit/vm/Threads.waitForUnblockSignal()V(Native Method)
> >>>>>>>        at
> >> jrockit/vm/Locks.fatLockBlockOrSpin(Locks.java:1411)[optimized]
> >>>>>>>        at jrockit/vm/Locks.lockFat(Locks.java:1512)[optimized]
> >>>>>>>        at
> >>>>>>>
> >>>>
> jrockit/vm/Locks.monitorEnterSecondStageHard(Locks.java:1054)[optimized]
> >>>>>>>        at
> >>>>>>>
> jrockit/vm/Locks.monitorEnterSecondStage(Locks.java:1005)[optimized]
> >>>>>>>        at jrockit/vm/Locks.monitorEnter(Locks.java:2179)[optimized]
> >>>>>>>        at
> >>>>>>>
> >>>>
> >>
> org/apache/solr/update/DefaultSolrCoreState.doRecovery(DefaultSolrCoreState.java:290)
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>        at
> >>>>>>>
> >>>>
> >>
> org/apache/solr/handler/admin/CoreAdminHandler$2.run(CoreAdminHandler.java:770)
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>        at jrockit/vm/RNI.c2java(JJJJJ)V(Native Method)
> >>>>
> >>
>
-- 
- Mark
about.me/markrmiller

Re: Recovery Thread Blocked

Reply via email to