Hey guys,

(copy of my post to SOLR-5216)

We tested this patch and unfortunately encountered some serious issues a
few hours of 500 update-batches/sec. Our update batch is 10 docs, so we are
writing about 5000 docs/sec total, using autoCommit to commit the updates
(no explicit commits).

Our environment:

    Solr 4.3.1 w/SOLR-5216 patch.
    Jetty 9, Java 1.7.
    3 solr instances, 1 per physical server.
    1 collection.
    3 shards.
    2 replicas (each instance is a leader and a replica).
    Soft autoCommit is 1000ms.
    Hard autoCommit is 15000ms.

After about 6 hours of stress-testing this patch, we see many of these
stalled transactions (below), and the Solr instances start to see each
other as down, flooding our Solr logs with "Connection Refused" exceptions,
and otherwise no obviously-useful logs that I could see.

I did notice some stalled transactions on both /select and /update,
however. This never occurred without this patch.

Stack /select seems stalled on: http://pastebin.com/Y1NCrXGC
Stack /update seems stalled on: http://pastebin.com/cFLbC8Y9

Lastly, I have a summary of the ERROR-severity logs from this 24-hour soak.
My script "normalizes" the ERROR-severity stack traces and returns them in
order of occurrence.

Summary of my solr.log: http://pastebin.com/pBdMAWeb

Thanks!

Tim Vaillancourt


On 6 September 2013 07:27, Markus Jelsma <markus.jel...@openindex.io> wrote:

> Thanks!
>
> -----Original message-----
> > From:Erick Erickson <erickerick...@gmail.com>
> > Sent: Friday 6th September 2013 16:20
> > To: solr-user@lucene.apache.org
> > Subject: Re: SolrCloud 4.x hangs under high update volume
> >
> > Markus:
> >
> > See: https://issues.apache.org/jira/browse/SOLR-5216
> >
> >
> > On Wed, Sep 4, 2013 at 11:04 AM, Markus Jelsma
> > <markus.jel...@openindex.io>wrote:
> >
> > > Hi Mark,
> > >
> > > Got an issue to watch?
> > >
> > > Thanks,
> > > Markus
> > >
> > > -----Original message-----
> > > > From:Mark Miller <markrmil...@gmail.com>
> > > > Sent: Wednesday 4th September 2013 16:55
> > > > To: solr-user@lucene.apache.org
> > > > Subject: Re: SolrCloud 4.x hangs under high update volume
> > > >
> > > > I'm going to try and fix the root cause for 4.5 - I've suspected
> what it
> > > is since early this year, but it's never personally been an issue, so
> it's
> > > rolled along for a long time.
> > > >
> > > > Mark
> > > >
> > > > Sent from my iPhone
> > > >
> > > > On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt <t...@elementspace.com>
> > > wrote:
> > > >
> > > > > Hey guys,
> > > > >
> > > > > I am looking into an issue we've been having with SolrCloud since
> the
> > > > > beginning of our testing, all the way from 4.1 to 4.3 (haven't
> tested
> > > 4.4.0
> > > > > yet). I've noticed other users with this same issue, so I'd really
> > > like to
> > > > > get to the bottom of it.
> > > > >
> > > > > Under a very, very high rate of updates (2000+/sec), after 1-12
> hours
> > > we
> > > > > see stalled transactions that snowball to consume all Jetty
> threads in
> > > the
> > > > > JVM. This eventually causes the JVM to hang with most threads
> waiting
> > > on
> > > > > the condition/stack provided at the bottom of this message. At this
> > > point
> > > > > SolrCloud instances then start to see their neighbors (who also
> have
> > > all
> > > > > threads hung) as down w/"Connection Refused", and the shards become
> > > "down"
> > > > > in state. Sometimes a node or two survives and just returns 503s
> "no
> > > server
> > > > > hosting shard" errors.
> > > > >
> > > > > As a workaround/experiment, we have tuned the number of threads
> sending
> > > > > updates to Solr, as well as the batch size (we batch updates from
> > > client ->
> > > > > solr), and the Soft/Hard autoCommits, all to no avail. Turning off
> > > > > Client-to-Solr batching (1 update = 1 call to Solr), which also
> did not
> > > > > help. Certain combinations of update threads and batch sizes seem
> to
> > > > > mask/help the problem, but not resolve it entirely.
> > > > >
> > > > > Our current environment is the following:
> > > > > - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7.
> > > > > - 3 x Zookeeper instances, external Java 7 JVM.
> > > > > - 1 collection, 3 shards, 2 replicas (each node is a leader of 1
> shard
> > > and
> > > > > a replica of 1 shard).
> > > > > - Log4j 1.2 for Solr logs, set to WARN. This log has no movement
> on a
> > > good
> > > > > day.
> > > > > - 5000 max jetty threads (well above what we use when we are
> healthy),
> > > > > Linux-user threads ulimit is 6000.
> > > > > - Occurs under Jetty 8 or 9 (many versions).
> > > > > - Occurs under Java 1.6 or 1.7 (several minor versions).
> > > > > - Occurs under several JVM tunings.
> > > > > - Everything seems to point to Solr itself, and not a Jetty or Java
> > > version
> > > > > (I hope I'm wrong).
> > > > >
> > > > > The stack trace that is holding up all my Jetty QTP threads is the
> > > > > following, which seems to be waiting on a lock that I would very
> much
> > > like
> > > > > to understand further:
> > > > >
> > > > > "java.lang.Thread.State: WAITING (parking)
> > > > >    at sun.misc.Unsafe.park(Native Method)
> > > > >    - parking to wait for  <0x00000007216e68d8> (a
> > > > > java.util.concurrent.Semaphore$NonfairSync)
> > > > >    at
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> > > > >    at
> > > > >
> > >
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
> > > > >    at
> > > > >
> > >
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
> > > > >    at
> > > > >
> > >
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
> > > > >    at java.util.concurrent.Semaphore.acquire(Semaphore.java:317)
> > > > >    at
> > > > >
> > >
> org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)
> > > > >    at
> > > > >
> > >
> org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:418)
> > > > >    at
> > > > >
> > >
> org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:368)
> > > > >    at
> > > > >
> > >
> org.apache.solr.update.SolrCmdDistributor.flushAdds(SolrCmdDistributor.java:300)
> > > > >    at
> > > > >
> > >
> org.apache.solr.update.SolrCmdDistributor.finish(SolrCmdDistributor.java:96)
> > > > >    at
> > > > >
> > >
> org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:462)
> > > > >    at
> > > > >
> > >
> org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1178)
> > > > >    at
> > > > >
> > >
> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83)
> > > > >    at
> > > > >
> > >
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> > > > >    at org.apache.solr.core.SolrCore.execute(SolrCore.java:1820)
> > > > >    at
> > > > >
> > >
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656)
> > > > >    at
> > > > >
> > >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359)
> > > > >    at
> > > > >
> > >
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
> > > > >    at
> > > > >
> > >
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1486)
> > > > >    at
> > > > >
> > >
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:503)
> > > > >    at
> > > > >
> > >
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138)
> > > > >    at
> > > > >
> > >
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564)
> > > > >    at
> > > > >
> > >
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213)
> > > > >    at
> > > > >
> > >
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1096)
> > > > >    at
> > > > >
> > >
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:432)
> > > > >    at
> > > > >
> > >
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175)
> > > > >    at
> > > > >
> > >
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1030)
> > > > >    at
> > > > >
> > >
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136)
> > > > >    at
> > > > >
> > >
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:201)
> > > > >    at
> > > > >
> > >
> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109)
> > > > >    at
> > > > >
> > >
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
> > > > >    at org.eclipse.jetty.server.Server.handle(Server.java:445)
> > > > >    at
> org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:268)
> > > > >    at
> > > > >
> > >
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:229)
> > > > >    at
> > > > >
> > >
> org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358)
> > > > >    at
> > > > >
> > >
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:601)
> > > > >    at
> > > > >
> > >
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:532)
> > > > >    at java.lang.Thread.run(Thread.java:724)"
> > > > >
> > > > > Some questions I had were:
> > > > > 1) What exclusive locks does SolrCloud "make" when performing an
> > > update?
> > > > > 2) Keeping in mind I do not read or write java (sorry :D), could
> > > someone
> > > > > help me understand "what" solr is locking in this case at
> > > > >
> > >
> "org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)"
> > > > > when performing an update? That will help me understand where to
> look
> > > next.
> > > > > 3) It seems all threads in this state are waiting for
> > > "0x00000007216e68d8",
> > > > > is there a way to tell what "0x00000007216e68d8" is?
> > > > > 4) Is there a limit to how many updates you can do in SolrCloud?
> > > > > 5) Wild-ass-theory: would more shards provide more locks (whatever
> they
> > > > > are) on update, and thus more update throughput?
> > > > >
> > > > > To those interested, I've provided a stacktrace of 1 of 3 nodes at
> > > this URL
> > > > > in gzipped form:
> > > > >
> > >
> https://s3.amazonaws.com/timvaillancourt.com/tmp/solr-jstack-2013-08-23.gz
> > > > >
> > > > > Any help/suggestions/ideas on this issue, big or small, would be
> much
> > > > > appreciated.
> > > > >
> > > > > Thanks so much all!
> > > > >
> > > > > Tim Vaillancourt
> > > >
> > >
> >
>

Reply via email to