Here is a dump after the delete, indexing has been stopped:
https://gist.github.com/bretthoerner/c7ea3bf3dc9e676a3f0e

An interesting hint that I forgot to mention: it doesn't always happen on
the first delete. I manually ran the delete cron, and the server continued
to work. I waited about 5 minutes and ran it again and it stalled the
indexer (as seen from indexer process): http://i.imgur.com/1Tt35u0.png

Another thing I forgot to mention. To bring the cluster back to life I:

1) stop my indexer
2) stop server1, start server1
3) stop server2, start start2
4) manually rebalance half of the shards to be mastered on server2
(unload/create on server1)
5) restart indexer

And it works again until a delete eventually kills it.

To be clear again, select queries continue to work indefinitely.

Thanks,
Brett


On Wed, Mar 6, 2013 at 1:50 PM, Mark Miller <markrmil...@gmail.com> wrote:

> Which version of Solr?
>
> Can you use jconsole, visualvm, or jstack to get some stack traces and see
> where things are halting?
>
> - Mark
>
> On Mar 6, 2013, at 11:45 AM, Brett Hoerner <br...@bretthoerner.com> wrote:
>
> > I have a SolrCloud cluster (2 machines, 2 Solr instances, 32 shards,
> > replication factor of 2) that I've been using for over a month now in
> > production.
> >
> > Suddenly, the hourly cron I run that dispatches a delete by query
> > completely halts all indexing. Select queries still run (and quickly),
> > there is no CPU or disk I/O happening, but suddenly my indexer (which
> runs
> > at ~400 doc/sec steady) pauses, and everything blocks indefinitely.
> >
> > To clarify some on the schema, this is a moving window of data (imagine
> > messages that don't matter after a 24 hour period) which are regularly
> > "chopped" off by my hourly cron (deleting messages over 24 hours old) to
> > keep the index size reasonable.
> >
> > There are no errors (log level warn) in the logs. I'm not sure what to
> look
> > into. As I've said this has been running (delete included) for about a
> > month.
> >
> > I'll also note that I have another cluster much like this one where I do
> > the very same thing... it has 4 machines, and indexes 10x the documents
> per
> > second, with more indexes... and yet I delete on a cron without issue...
> >
> > Any ideas on where to start, or other information I could provide?
> >
> > Thanks much.
>
>

Reply via email to