Thans Brett, good stuff (though not a good problem).

We def need to look into this. 

- Mark

On Mar 6, 2013, at 1:53 PM, Brett Hoerner <br...@bretthoerner.com> wrote:

> Here is a dump after the delete, indexing has been stopped:
> https://gist.github.com/bretthoerner/c7ea3bf3dc9e676a3f0e
> 
> An interesting hint that I forgot to mention: it doesn't always happen on
> the first delete. I manually ran the delete cron, and the server continued
> to work. I waited about 5 minutes and ran it again and it stalled the
> indexer (as seen from indexer process): http://i.imgur.com/1Tt35u0.png
> 
> Another thing I forgot to mention. To bring the cluster back to life I:
> 
> 1) stop my indexer
> 2) stop server1, start server1
> 3) stop server2, start start2
> 4) manually rebalance half of the shards to be mastered on server2
> (unload/create on server1)
> 5) restart indexer
> 
> And it works again until a delete eventually kills it.
> 
> To be clear again, select queries continue to work indefinitely.
> 
> Thanks,
> Brett
> 
> 
> On Wed, Mar 6, 2013 at 1:50 PM, Mark Miller <markrmil...@gmail.com> wrote:
> 
>> Which version of Solr?
>> 
>> Can you use jconsole, visualvm, or jstack to get some stack traces and see
>> where things are halting?
>> 
>> - Mark
>> 
>> On Mar 6, 2013, at 11:45 AM, Brett Hoerner <br...@bretthoerner.com> wrote:
>> 
>>> I have a SolrCloud cluster (2 machines, 2 Solr instances, 32 shards,
>>> replication factor of 2) that I've been using for over a month now in
>>> production.
>>> 
>>> Suddenly, the hourly cron I run that dispatches a delete by query
>>> completely halts all indexing. Select queries still run (and quickly),
>>> there is no CPU or disk I/O happening, but suddenly my indexer (which
>> runs
>>> at ~400 doc/sec steady) pauses, and everything blocks indefinitely.
>>> 
>>> To clarify some on the schema, this is a moving window of data (imagine
>>> messages that don't matter after a 24 hour period) which are regularly
>>> "chopped" off by my hourly cron (deleting messages over 24 hours old) to
>>> keep the index size reasonable.
>>> 
>>> There are no errors (log level warn) in the logs. I'm not sure what to
>> look
>>> into. As I've said this has been running (delete included) for about a
>>> month.
>>> 
>>> I'll also note that I have another cluster much like this one where I do
>>> the very same thing... it has 4 machines, and indexes 10x the documents
>> per
>>> second, with more indexes... and yet I delete on a cron without issue...
>>> 
>>> Any ideas on where to start, or other information I could provide?
>>> 
>>> Thanks much.
>> 
>> 

Reply via email to