Tim,

Take a look at 
http://lucene.472066.n3.nabble.com/updating-docs-in-solr-cloud-hangs-td4067388.html
 and https://issues.apache.org/jira/browse/SOLR-4816. I had the same issue that 
you're reporting for a while then I applied the patch from SOLR-4816 to my 
clients and the problems went away. If you don't feel like applying the patch 
it looks like it should be included in the release of version 4.5. Also note 
that the problem happens more frequently when the replication factor is greater 
than 1.

Thanks,
Greg

-----Original Message-----
From: Tim Vaillancourt [mailto:t...@elementspace.com] 
Sent: Tuesday, September 03, 2013 6:31 PM
To: solr-user@lucene.apache.org
Subject: SolrCloud 4.x hangs under high update volume

Hey guys,

I am looking into an issue we've been having with SolrCloud since the beginning 
of our testing, all the way from 4.1 to 4.3 (haven't tested 4.4.0 yet). I've 
noticed other users with this same issue, so I'd really like to get to the 
bottom of it.

Under a very, very high rate of updates (2000+/sec), after 1-12 hours we see 
stalled transactions that snowball to consume all Jetty threads in the JVM. 
This eventually causes the JVM to hang with most threads waiting on the 
condition/stack provided at the bottom of this message. At this point SolrCloud 
instances then start to see their neighbors (who also have all threads hung) as 
down w/"Connection Refused", and the shards become "down"
in state. Sometimes a node or two survives and just returns 503s "no server 
hosting shard" errors.

As a workaround/experiment, we have tuned the number of threads sending updates 
to Solr, as well as the batch size (we batch updates from client -> solr), and 
the Soft/Hard autoCommits, all to no avail. Turning off Client-to-Solr batching 
(1 update = 1 call to Solr), which also did not help. Certain combinations of 
update threads and batch sizes seem to mask/help the problem, but not resolve 
it entirely.

Our current environment is the following:
- 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7.
- 3 x Zookeeper instances, external Java 7 JVM.
- 1 collection, 3 shards, 2 replicas (each node is a leader of 1 shard and a 
replica of 1 shard).
- Log4j 1.2 for Solr logs, set to WARN. This log has no movement on a good day.
- 5000 max jetty threads (well above what we use when we are healthy), 
Linux-user threads ulimit is 6000.
- Occurs under Jetty 8 or 9 (many versions).
- Occurs under Java 1.6 or 1.7 (several minor versions).
- Occurs under several JVM tunings.
- Everything seems to point to Solr itself, and not a Jetty or Java version (I 
hope I'm wrong).

The stack trace that is holding up all my Jetty QTP threads is the following, 
which seems to be waiting on a lock that I would very much like to understand 
further:

"java.lang.Thread.State: WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for  <0x00000007216e68d8> (a
java.util.concurrent.Semaphore$NonfairSync)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
    at
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
    at
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
    at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
    at java.util.concurrent.Semaphore.acquire(Semaphore.java:317)
    at
org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)
    at
org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:418)
    at
org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:368)
    at
org.apache.solr.update.SolrCmdDistributor.flushAdds(SolrCmdDistributor.java:300)
    at
org.apache.solr.update.SolrCmdDistributor.finish(SolrCmdDistributor.java:96)
    at
org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:462)
    at
org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1178)
    at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83)
    at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
    at org.apache.solr.core.SolrCore.execute(SolrCore.java:1820)
    at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656)
    at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359)
    at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155)
    at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1486)
    at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:503)
    at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138)
    at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564)
    at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213)
    at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1096)
    at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:432)
    at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175)
    at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1030)
    at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136)
    at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:201)
    at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109)
    at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
    at org.eclipse.jetty.server.Server.handle(Server.java:445)
    at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:268)
    at
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:229)
    at
org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358)
    at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:601)
    at
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:532)
    at java.lang.Thread.run(Thread.java:724)"

Some questions I had were:
1) What exclusive locks does SolrCloud "make" when performing an update?
2) Keeping in mind I do not read or write java (sorry :D), could someone help 
me understand "what" solr is locking in this case at 
"org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)"
when performing an update? That will help me understand where to look next.
3) It seems all threads in this state are waiting for "0x00000007216e68d8", is 
there a way to tell what "0x00000007216e68d8" is?
4) Is there a limit to how many updates you can do in SolrCloud?
5) Wild-ass-theory: would more shards provide more locks (whatever they
are) on update, and thus more update throughput?

To those interested, I've provided a stacktrace of 1 of 3 nodes at this URL in 
gzipped form:
https://s3.amazonaws.com/timvaillancourt.com/tmp/solr-jstack-2013-08-23.gz

Any help/suggestions/ideas on this issue, big or small, would be much 
appreciated.

Thanks so much all!

Tim Vaillancourt

Reply via email to