This is what I see: We currently limit the number of outstanding update requests at one time to avoid a crazy number of threads being used.
It looks like a bunch of update requests are stuck in socket reads and are taking up the available threads. It looks like the deletes are hanging out waiting for a free thread. It seems the question is, why are the requests stuck in socket reads. I don't have an answer at the moment. We should probably get this into a JIRA issue though. - Mark On Mar 6, 2013, at 2:15 PM, Alexandre Rafalovitch <arafa...@gmail.com> wrote: > It does not look like a deadlock, though it could be a distributed one. Or > it could be a livelock, though that's less likely. > > Here is what we used to recommend in similar situations for large Java > systems (BEA Weblogic): > 1) Do thread dump of both systems before anything. As simultaneous as you > can make it. > 2) Do the first delete. Do a thread dump every 2 minutes on both servers > (so, say 3 dumps in that 5 minute wait) > 3) Do the second delete and do thread dumps every 30 seconds on both > servers from just before and then during. Preferably all the way until the > problem shows itself. Every 5 seconds if the problem shows itself really > quick. > > That gives you a LOT of thread dumps. But it also gives you something that > allows to compare thread state before and after the problem starting > showing itself and to identify moving (or unnaturally still) threads. I > even wrote a tool long time ago that parsed those thread dumps > automatically and generated pretty deadlock graphs of those. > > > Regards, > Alex. > > > > > > Personal blog: http://blog.outerthoughts.com/ > LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch > - Time is the quality of nature that keeps events from happening all at > once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) > > > On Wed, Mar 6, 2013 at 5:04 PM, Mark Miller <markrmil...@gmail.com> wrote: > >> Thans Brett, good stuff (though not a good problem). >> >> We def need to look into this. >> >> - Mark >> >> On Mar 6, 2013, at 1:53 PM, Brett Hoerner <br...@bretthoerner.com> wrote: >> >>> Here is a dump after the delete, indexing has been stopped: >>> https://gist.github.com/bretthoerner/c7ea3bf3dc9e676a3f0e >>> >>> An interesting hint that I forgot to mention: it doesn't always happen on >>> the first delete. I manually ran the delete cron, and the server >> continued >>> to work. I waited about 5 minutes and ran it again and it stalled the >>> indexer (as seen from indexer process): http://i.imgur.com/1Tt35u0.png >>> >>> Another thing I forgot to mention. To bring the cluster back to life I: >>> >>> 1) stop my indexer >>> 2) stop server1, start server1 >>> 3) stop server2, start start2 >>> 4) manually rebalance half of the shards to be mastered on server2 >>> (unload/create on server1) >>> 5) restart indexer >>> >>> And it works again until a delete eventually kills it. >>> >>> To be clear again, select queries continue to work indefinitely. >>> >>> Thanks, >>> Brett >>> >>> >>> On Wed, Mar 6, 2013 at 1:50 PM, Mark Miller <markrmil...@gmail.com> >> wrote: >>> >>>> Which version of Solr? >>>> >>>> Can you use jconsole, visualvm, or jstack to get some stack traces and >> see >>>> where things are halting? >>>> >>>> - Mark >>>> >>>> On Mar 6, 2013, at 11:45 AM, Brett Hoerner <br...@bretthoerner.com> >> wrote: >>>> >>>>> I have a SolrCloud cluster (2 machines, 2 Solr instances, 32 shards, >>>>> replication factor of 2) that I've been using for over a month now in >>>>> production. >>>>> >>>>> Suddenly, the hourly cron I run that dispatches a delete by query >>>>> completely halts all indexing. Select queries still run (and quickly), >>>>> there is no CPU or disk I/O happening, but suddenly my indexer (which >>>> runs >>>>> at ~400 doc/sec steady) pauses, and everything blocks indefinitely. >>>>> >>>>> To clarify some on the schema, this is a moving window of data (imagine >>>>> messages that don't matter after a 24 hour period) which are regularly >>>>> "chopped" off by my hourly cron (deleting messages over 24 hours old) >> to >>>>> keep the index size reasonable. >>>>> >>>>> There are no errors (log level warn) in the logs. I'm not sure what to >>>> look >>>>> into. As I've said this has been running (delete included) for about a >>>>> month. >>>>> >>>>> I'll also note that I have another cluster much like this one where I >> do >>>>> the very same thing... it has 4 machines, and indexes 10x the documents >>>> per >>>>> second, with more indexes... and yet I delete on a cron without >> issue... >>>>> >>>>> Any ideas on where to start, or other information I could provide? >>>>> >>>>> Thanks much. >>>> >>>> >> >>