Here is a dump after the delete, indexing has been stopped: https://gist.github.com/bretthoerner/c7ea3bf3dc9e676a3f0e
An interesting hint that I forgot to mention: it doesn't always happen on the first delete. I manually ran the delete cron, and the server continued to work. I waited about 5 minutes and ran it again and it stalled the indexer (as seen from indexer process): http://i.imgur.com/1Tt35u0.png Another thing I forgot to mention. To bring the cluster back to life I: 1) stop my indexer 2) stop server1, start server1 3) stop server2, start start2 4) manually rebalance half of the shards to be mastered on server2 (unload/create on server1) 5) restart indexer And it works again until a delete eventually kills it. To be clear again, select queries continue to work indefinitely. Thanks, Brett On Wed, Mar 6, 2013 at 1:50 PM, Mark Miller <markrmil...@gmail.com> wrote: > Which version of Solr? > > Can you use jconsole, visualvm, or jstack to get some stack traces and see > where things are halting? > > - Mark > > On Mar 6, 2013, at 11:45 AM, Brett Hoerner <br...@bretthoerner.com> wrote: > > > I have a SolrCloud cluster (2 machines, 2 Solr instances, 32 shards, > > replication factor of 2) that I've been using for over a month now in > > production. > > > > Suddenly, the hourly cron I run that dispatches a delete by query > > completely halts all indexing. Select queries still run (and quickly), > > there is no CPU or disk I/O happening, but suddenly my indexer (which > runs > > at ~400 doc/sec steady) pauses, and everything blocks indefinitely. > > > > To clarify some on the schema, this is a moving window of data (imagine > > messages that don't matter after a 24 hour period) which are regularly > > "chopped" off by my hourly cron (deleting messages over 24 hours old) to > > keep the index size reasonable. > > > > There are no errors (log level warn) in the logs. I'm not sure what to > look > > into. As I've said this has been running (delete included) for about a > > month. > > > > I'll also note that I have another cluster much like this one where I do > > the very same thing... it has 4 machines, and indexes 10x the documents > per > > second, with more indexes... and yet I delete on a cron without issue... > > > > Any ideas on where to start, or other information I could provide? > > > > Thanks much. > >