Re: SolrCloud 4.x hangs under high update volume

Mark Miller Wed, 04 Sep 2013 10:45:17 -0700

The 'lock' or semaphore was added to cap the number of threads that would be 
used. Previously, the number of threads in use could spike to many, many 
thousands on heavy updates. A limit on the number of outstanding requests was 
put in place to keep this from happening. Something like 16 * the number of 
hosts in the cluster.


I assume the deadlock comes from the fact that requests are of two kinds - 
forward to the leader and distrib updates from the leader to replicas. Forward 
to the leader actually waits for the leader to then distrib the updates to 
replicas before returning. I believe this is what can lead to deadlock. 

This is likely why the patch for the CloudSolrServer can help the situation - 
it removes the need to forward to the leader because it sends to the correct 
leader to begin with. Only useful if you are adding docs with CloudSolrServer 
though, and more like a workaround than a fix.

The patch uses a separate 'limiting' semaphore for the two cases.

- Mark

On Sep 4, 2013, at 10:22 AM, Tim Vaillancourt <t...@elementspace.com> wrote:

> Thanks guys! :)
> 
> Mark: this patch is much appreciated, I will try to test this shortly, 
> hopefully today.
> 
> For my curiosity/understanding, could someone explain to me quickly what 
> locks SolrCloud takes on updates? Was I on to something that more shards 
> decrease the chance for locking?
> 
> Secondly, I was wondering if someone could summarize what this patch 'fixes'? 
> I'm not too familiar with Java and the solr codebase (working on that though 
> :D).
> 
> Cheers,
> 
> Tim
> 
> 
> 
> On 4 September 2013 09:52, Mark Miller <markrmil...@gmail.com> wrote:
> There is an issue if I remember right, but I can't find it right now.
> 
> If anyone that has the problem could try this patch, that would be very
> helpful: http://pastebin.com/raw.php?i=aaRWwSGP
> 
> - Mark
> 
> 
> On Wed, Sep 4, 2013 at 8:04 AM, Markus Jelsma 
> <markus.jel...@openindex.io>wrote:
> 
> > Hi Mark,
> >
> > Got an issue to watch?
> >
> > Thanks,
> > Markus
> >
> > -----Original message-----
> > > From:Mark Miller <markrmil...@gmail.com>
> > > Sent: Wednesday 4th September 2013 16:55
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: SolrCloud 4.x hangs under high update volume
> > >
> > > I'm going to try and fix the root cause for 4.5 - I've suspected what it
> > is since early this year, but it's never personally been an issue, so it's
> > rolled along for a long time.
> > >
> > > Mark
> > >
> > > Sent from my iPhone
> > >
> > > On Sep 3, 2013, at 4:30 PM, Tim Vaillancourt <t...@elementspace.com>
> > wrote:
> > >
> > > > Hey guys,
> > > >
> > > > I am looking into an issue we've been having with SolrCloud since the
> > > > beginning of our testing, all the way from 4.1 to 4.3 (haven't tested
> > 4.4.0
> > > > yet). I've noticed other users with this same issue, so I'd really
> > like to
> > > > get to the bottom of it.
> > > >
> > > > Under a very, very high rate of updates (2000+/sec), after 1-12 hours
> > we
> > > > see stalled transactions that snowball to consume all Jetty threads in
> > the
> > > > JVM. This eventually causes the JVM to hang with most threads waiting
> > on
> > > > the condition/stack provided at the bottom of this message. At this
> > point
> > > > SolrCloud instances then start to see their neighbors (who also have
> > all
> > > > threads hung) as down w/"Connection Refused", and the shards become
> > "down"
> > > > in state. Sometimes a node or two survives and just returns 503s "no
> > server
> > > > hosting shard" errors.
> > > >
> > > > As a workaround/experiment, we have tuned the number of threads sending
> > > > updates to Solr, as well as the batch size (we batch updates from
> > client ->
> > > > solr), and the Soft/Hard autoCommits, all to no avail. Turning off
> > > > Client-to-Solr batching (1 update = 1 call to Solr), which also did not
> > > > help. Certain combinations of update threads and batch sizes seem to
> > > > mask/help the problem, but not resolve it entirely.
> > > >
> > > > Our current environment is the following:
> > > > - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7.
> > > > - 3 x Zookeeper instances, external Java 7 JVM.
> > > > - 1 collection, 3 shards, 2 replicas (each node is a leader of 1 shard
> > and
> > > > a replica of 1 shard).
> > > > - Log4j 1.2 for Solr logs, set to WARN. This log has no movement on a
> > good
> > > > day.
> > > > - 5000 max jetty threads (well above what we use when we are healthy),
> > > > Linux-user threads ulimit is 6000.
> > > > - Occurs under Jetty 8 or 9 (many versions).
> > > > - Occurs under Java 1.6 or 1.7 (several minor versions).
> > > > - Occurs under several JVM tunings.
> > > > - Everything seems to point to Solr itself, and not a Jetty or Java
> > version
> > > > (I hope I'm wrong).
> > > >
> > > > The stack trace that is holding up all my Jetty QTP threads is the
> > > > following, which seems to be waiting on a lock that I would very much
> > like
> > > > to understand further:
> > > >
> > > > "java.lang.Thread.State: WAITING (parking)
> > > >    at sun.misc.Unsafe.park(Native Method)
> > > >    - parking to wait for  <0x00000007216e68d8> (a
> > > > java.util.concurrent.Semaphore$NonfairSync)
> > > >    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> > > >    at
> > > >
> > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
> > > >    at
> > > >
> > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
> > > >    at
> > > >
> > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
> > > >    at java.util.concurrent.Semaphore.acquire(Semaphore.java:317)
> > > >    at
> > > >
> > org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)
> > > >    at
> > > >
> > org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:418)
> > > >    at
> > > >
> > org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:368)
> > > >    at
> > > >
> > org.apache.solr.update.SolrCmdDistributor.flushAdds(SolrCmdDistributor.java:300)
> > > >    at
> > > >
> > org.apache.solr.update.SolrCmdDistributor.finish(SolrCmdDistributor.java:96)
> > > >    at
> > > >
> > org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:462)
> > > >    at
> > > >
> > org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1178)
> > > >    at
> > > >
> > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83)
> > > >    at
> > > >
> > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> > > >    at org.apache.solr.core.SolrCore.execute(SolrCore.java:1820)
> > > >    at
> > > >
> > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656)
> > > >    at
> > > >
> > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359)
> > > >    at
> > > >
> > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
> > > >    at
> > > >
> > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1486)
> > > >    at
> > > >
> > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:503)
> > > >    at
> > > >
> > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138)
> > > >    at
> > > >
> > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564)
> > > >    at
> > > >
> > org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213)
> > > >    at
> > > >
> > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1096)
> > > >    at
> > > >
> > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:432)
> > > >    at
> > > >
> > org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175)
> > > >    at
> > > >
> > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1030)
> > > >    at
> > > >
> > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136)
> > > >    at
> > > >
> > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:201)
> > > >    at
> > > >
> > org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109)
> > > >    at
> > > >
> > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
> > > >    at org.eclipse.jetty.server.Server.handle(Server.java:445)
> > > >    at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:268)
> > > >    at
> > > >
> > org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:229)
> > > >    at
> > > >
> > org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358)
> > > >    at
> > > >
> > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:601)
> > > >    at
> > > >
> > org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:532)
> > > >    at java.lang.Thread.run(Thread.java:724)"
> > > >
> > > > Some questions I had were:
> > > > 1) What exclusive locks does SolrCloud "make" when performing an
> > update?
> > > > 2) Keeping in mind I do not read or write java (sorry :D), could
> > someone
> > > > help me understand "what" solr is locking in this case at
> > > >
> > "org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)"
> > > > when performing an update? That will help me understand where to look
> > next.
> > > > 3) It seems all threads in this state are waiting for
> > "0x00000007216e68d8",
> > > > is there a way to tell what "0x00000007216e68d8" is?
> > > > 4) Is there a limit to how many updates you can do in SolrCloud?
> > > > 5) Wild-ass-theory: would more shards provide more locks (whatever they
> > > > are) on update, and thus more update throughput?
> > > >
> > > > To those interested, I've provided a stacktrace of 1 of 3 nodes at
> > this URL
> > > > in gzipped form:
> > > >
> > https://s3.amazonaws.com/timvaillancourt.com/tmp/solr-jstack-2013-08-23.gz
> > > >
> > > > Any help/suggestions/ideas on this issue, big or small, would be much
> > > > appreciated.
> > > >
> > > > Thanks so much all!
> > > >
> > > > Tim Vaillancourt
> > >
> >
> 
> 
> 
> --
> - Mark
>

Re: SolrCloud 4.x hangs under high update volume

Reply via email to