It does not look like a deadlock, though it could be a distributed one. Or it could be a livelock, though that's less likely.
Here is what we used to recommend in similar situations for large Java systems (BEA Weblogic): 1) Do thread dump of both systems before anything. As simultaneous as you can make it. 2) Do the first delete. Do a thread dump every 2 minutes on both servers (so, say 3 dumps in that 5 minute wait) 3) Do the second delete and do thread dumps every 30 seconds on both servers from just before and then during. Preferably all the way until the problem shows itself. Every 5 seconds if the problem shows itself really quick. That gives you a LOT of thread dumps. But it also gives you something that allows to compare thread state before and after the problem starting showing itself and to identify moving (or unnaturally still) threads. I even wrote a tool long time ago that parsed those thread dumps automatically and generated pretty deadlock graphs of those. Regards, Alex. Personal blog: http://blog.outerthoughts.com/ LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch - Time is the quality of nature that keeps events from happening all at once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) On Wed, Mar 6, 2013 at 5:04 PM, Mark Miller <markrmil...@gmail.com> wrote: > Thans Brett, good stuff (though not a good problem). > > We def need to look into this. > > - Mark > > On Mar 6, 2013, at 1:53 PM, Brett Hoerner <br...@bretthoerner.com> wrote: > > > Here is a dump after the delete, indexing has been stopped: > > https://gist.github.com/bretthoerner/c7ea3bf3dc9e676a3f0e > > > > An interesting hint that I forgot to mention: it doesn't always happen on > > the first delete. I manually ran the delete cron, and the server > continued > > to work. I waited about 5 minutes and ran it again and it stalled the > > indexer (as seen from indexer process): http://i.imgur.com/1Tt35u0.png > > > > Another thing I forgot to mention. To bring the cluster back to life I: > > > > 1) stop my indexer > > 2) stop server1, start server1 > > 3) stop server2, start start2 > > 4) manually rebalance half of the shards to be mastered on server2 > > (unload/create on server1) > > 5) restart indexer > > > > And it works again until a delete eventually kills it. > > > > To be clear again, select queries continue to work indefinitely. > > > > Thanks, > > Brett > > > > > > On Wed, Mar 6, 2013 at 1:50 PM, Mark Miller <markrmil...@gmail.com> > wrote: > > > >> Which version of Solr? > >> > >> Can you use jconsole, visualvm, or jstack to get some stack traces and > see > >> where things are halting? > >> > >> - Mark > >> > >> On Mar 6, 2013, at 11:45 AM, Brett Hoerner <br...@bretthoerner.com> > wrote: > >> > >>> I have a SolrCloud cluster (2 machines, 2 Solr instances, 32 shards, > >>> replication factor of 2) that I've been using for over a month now in > >>> production. > >>> > >>> Suddenly, the hourly cron I run that dispatches a delete by query > >>> completely halts all indexing. Select queries still run (and quickly), > >>> there is no CPU or disk I/O happening, but suddenly my indexer (which > >> runs > >>> at ~400 doc/sec steady) pauses, and everything blocks indefinitely. > >>> > >>> To clarify some on the schema, this is a moving window of data (imagine > >>> messages that don't matter after a 24 hour period) which are regularly > >>> "chopped" off by my hourly cron (deleting messages over 24 hours old) > to > >>> keep the index size reasonable. > >>> > >>> There are no errors (log level warn) in the logs. I'm not sure what to > >> look > >>> into. As I've said this has been running (delete included) for about a > >>> month. > >>> > >>> I'll also note that I have another cluster much like this one where I > do > >>> the very same thing... it has 4 machines, and indexes 10x the documents > >> per > >>> second, with more indexes... and yet I delete on a cron without > issue... > >>> > >>> Any ideas on where to start, or other information I could provide? > >>> > >>> Thanks much. > >> > >> > >