Thans Brett, good stuff (though not a good problem). We def need to look into this.
- Mark On Mar 6, 2013, at 1:53 PM, Brett Hoerner <br...@bretthoerner.com> wrote: > Here is a dump after the delete, indexing has been stopped: > https://gist.github.com/bretthoerner/c7ea3bf3dc9e676a3f0e > > An interesting hint that I forgot to mention: it doesn't always happen on > the first delete. I manually ran the delete cron, and the server continued > to work. I waited about 5 minutes and ran it again and it stalled the > indexer (as seen from indexer process): http://i.imgur.com/1Tt35u0.png > > Another thing I forgot to mention. To bring the cluster back to life I: > > 1) stop my indexer > 2) stop server1, start server1 > 3) stop server2, start start2 > 4) manually rebalance half of the shards to be mastered on server2 > (unload/create on server1) > 5) restart indexer > > And it works again until a delete eventually kills it. > > To be clear again, select queries continue to work indefinitely. > > Thanks, > Brett > > > On Wed, Mar 6, 2013 at 1:50 PM, Mark Miller <markrmil...@gmail.com> wrote: > >> Which version of Solr? >> >> Can you use jconsole, visualvm, or jstack to get some stack traces and see >> where things are halting? >> >> - Mark >> >> On Mar 6, 2013, at 11:45 AM, Brett Hoerner <br...@bretthoerner.com> wrote: >> >>> I have a SolrCloud cluster (2 machines, 2 Solr instances, 32 shards, >>> replication factor of 2) that I've been using for over a month now in >>> production. >>> >>> Suddenly, the hourly cron I run that dispatches a delete by query >>> completely halts all indexing. Select queries still run (and quickly), >>> there is no CPU or disk I/O happening, but suddenly my indexer (which >> runs >>> at ~400 doc/sec steady) pauses, and everything blocks indefinitely. >>> >>> To clarify some on the schema, this is a moving window of data (imagine >>> messages that don't matter after a 24 hour period) which are regularly >>> "chopped" off by my hourly cron (deleting messages over 24 hours old) to >>> keep the index size reasonable. >>> >>> There are no errors (log level warn) in the logs. I'm not sure what to >> look >>> into. As I've said this has been running (delete included) for about a >>> month. >>> >>> I'll also note that I have another cluster much like this one where I do >>> the very same thing... it has 4 machines, and indexes 10x the documents >> per >>> second, with more indexes... and yet I delete on a cron without issue... >>> >>> Any ideas on where to start, or other information I could provide? >>> >>> Thanks much. >> >>