Tim, Take a look at http://lucene.472066.n3.nabble.com/updating-docs-in-solr-cloud-hangs-td4067388.html and https://issues.apache.org/jira/browse/SOLR-4816. I had the same issue that you're reporting for a while then I applied the patch from SOLR-4816 to my clients and the problems went away. If you don't feel like applying the patch it looks like it should be included in the release of version 4.5. Also note that the problem happens more frequently when the replication factor is greater than 1.
Thanks, Greg -----Original Message----- From: Tim Vaillancourt [mailto:t...@elementspace.com] Sent: Tuesday, September 03, 2013 6:31 PM To: solr-user@lucene.apache.org Subject: SolrCloud 4.x hangs under high update volume Hey guys, I am looking into an issue we've been having with SolrCloud since the beginning of our testing, all the way from 4.1 to 4.3 (haven't tested 4.4.0 yet). I've noticed other users with this same issue, so I'd really like to get to the bottom of it. Under a very, very high rate of updates (2000+/sec), after 1-12 hours we see stalled transactions that snowball to consume all Jetty threads in the JVM. This eventually causes the JVM to hang with most threads waiting on the condition/stack provided at the bottom of this message. At this point SolrCloud instances then start to see their neighbors (who also have all threads hung) as down w/"Connection Refused", and the shards become "down" in state. Sometimes a node or two survives and just returns 503s "no server hosting shard" errors. As a workaround/experiment, we have tuned the number of threads sending updates to Solr, as well as the batch size (we batch updates from client -> solr), and the Soft/Hard autoCommits, all to no avail. Turning off Client-to-Solr batching (1 update = 1 call to Solr), which also did not help. Certain combinations of update threads and batch sizes seem to mask/help the problem, but not resolve it entirely. Our current environment is the following: - 3 x Solr 4.3.1 instances in Jetty 9 w/Java 7. - 3 x Zookeeper instances, external Java 7 JVM. - 1 collection, 3 shards, 2 replicas (each node is a leader of 1 shard and a replica of 1 shard). - Log4j 1.2 for Solr logs, set to WARN. This log has no movement on a good day. - 5000 max jetty threads (well above what we use when we are healthy), Linux-user threads ulimit is 6000. - Occurs under Jetty 8 or 9 (many versions). - Occurs under Java 1.6 or 1.7 (several minor versions). - Occurs under several JVM tunings. - Everything seems to point to Solr itself, and not a Jetty or Java version (I hope I'm wrong). The stack trace that is holding up all my Jetty QTP threads is the following, which seems to be waiting on a lock that I would very much like to understand further: "java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00000007216e68d8> (a java.util.concurrent.Semaphore$NonfairSync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303) at java.util.concurrent.Semaphore.acquire(Semaphore.java:317) at org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61) at org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:418) at org.apache.solr.update.SolrCmdDistributor.submit(SolrCmdDistributor.java:368) at org.apache.solr.update.SolrCmdDistributor.flushAdds(SolrCmdDistributor.java:300) at org.apache.solr.update.SolrCmdDistributor.finish(SolrCmdDistributor.java:96) at org.apache.solr.update.processor.DistributedUpdateProcessor.doFinish(DistributedUpdateProcessor.java:462) at org.apache.solr.update.processor.DistributedUpdateProcessor.finish(DistributedUpdateProcessor.java:1178) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:83) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1820) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:656) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:155) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1486) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:503) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:138) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:564) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:213) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1096) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:432) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:175) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1030) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:136) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:201) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:109) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) at org.eclipse.jetty.server.Server.handle(Server.java:445) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:268) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:229) at org.eclipse.jetty.io.AbstractConnection$ReadCallback.run(AbstractConnection.java:358) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:601) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:532) at java.lang.Thread.run(Thread.java:724)" Some questions I had were: 1) What exclusive locks does SolrCloud "make" when performing an update? 2) Keeping in mind I do not read or write java (sorry :D), could someone help me understand "what" solr is locking in this case at "org.apache.solr.util.AdjustableSemaphore.acquire(AdjustableSemaphore.java:61)" when performing an update? That will help me understand where to look next. 3) It seems all threads in this state are waiting for "0x00000007216e68d8", is there a way to tell what "0x00000007216e68d8" is? 4) Is there a limit to how many updates you can do in SolrCloud? 5) Wild-ass-theory: would more shards provide more locks (whatever they are) on update, and thus more update throughput? To those interested, I've provided a stacktrace of 1 of 3 nodes at this URL in gzipped form: https://s3.amazonaws.com/timvaillancourt.com/tmp/solr-jstack-2013-08-23.gz Any help/suggestions/ideas on this issue, big or small, would be much appreciated. Thanks so much all! Tim Vaillancourt