This is what I see:

We currently limit the number of outstanding update requests at one time to 
avoid a crazy number of threads being used.

It looks like a bunch of update requests are stuck in socket reads and are 
taking up the available threads. It looks like the deletes are hanging out 
waiting for a free thread.

It seems the question is, why are the requests stuck in socket reads. I don't 
have an answer at the moment.

We should probably get this into a JIRA issue though.

- Mark


On Mar 6, 2013, at 2:15 PM, Alexandre Rafalovitch <arafa...@gmail.com> wrote:

> It does not look like a deadlock, though it could be a distributed one. Or
> it could be a livelock, though that's less likely.
> 
> Here is what we used to recommend in similar situations for large Java
> systems (BEA Weblogic):
> 1) Do thread dump of both systems before anything. As simultaneous as you
> can make it.
> 2) Do the first delete. Do a thread dump every 2 minutes on both servers
> (so, say 3 dumps in that 5 minute wait)
> 3) Do the second delete and do thread dumps every 30 seconds on both
> servers from just before and then during. Preferably all the way until the
> problem shows itself. Every 5 seconds if the problem shows itself really
> quick.
> 
> That gives you a LOT of thread dumps. But it also gives you something that
> allows to compare thread state before and after the problem starting
> showing itself and to identify moving (or unnaturally still) threads. I
> even wrote a tool long time ago that parsed those thread dumps
> automatically and generated pretty deadlock graphs of those.
> 
> 
> Regards,
>   Alex.
> 
> 
> 
> 
> 
> Personal blog: http://blog.outerthoughts.com/
> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> - Time is the quality of nature that keeps events from happening all at
> once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)
> 
> 
> On Wed, Mar 6, 2013 at 5:04 PM, Mark Miller <markrmil...@gmail.com> wrote:
> 
>> Thans Brett, good stuff (though not a good problem).
>> 
>> We def need to look into this.
>> 
>> - Mark
>> 
>> On Mar 6, 2013, at 1:53 PM, Brett Hoerner <br...@bretthoerner.com> wrote:
>> 
>>> Here is a dump after the delete, indexing has been stopped:
>>> https://gist.github.com/bretthoerner/c7ea3bf3dc9e676a3f0e
>>> 
>>> An interesting hint that I forgot to mention: it doesn't always happen on
>>> the first delete. I manually ran the delete cron, and the server
>> continued
>>> to work. I waited about 5 minutes and ran it again and it stalled the
>>> indexer (as seen from indexer process): http://i.imgur.com/1Tt35u0.png
>>> 
>>> Another thing I forgot to mention. To bring the cluster back to life I:
>>> 
>>> 1) stop my indexer
>>> 2) stop server1, start server1
>>> 3) stop server2, start start2
>>> 4) manually rebalance half of the shards to be mastered on server2
>>> (unload/create on server1)
>>> 5) restart indexer
>>> 
>>> And it works again until a delete eventually kills it.
>>> 
>>> To be clear again, select queries continue to work indefinitely.
>>> 
>>> Thanks,
>>> Brett
>>> 
>>> 
>>> On Wed, Mar 6, 2013 at 1:50 PM, Mark Miller <markrmil...@gmail.com>
>> wrote:
>>> 
>>>> Which version of Solr?
>>>> 
>>>> Can you use jconsole, visualvm, or jstack to get some stack traces and
>> see
>>>> where things are halting?
>>>> 
>>>> - Mark
>>>> 
>>>> On Mar 6, 2013, at 11:45 AM, Brett Hoerner <br...@bretthoerner.com>
>> wrote:
>>>> 
>>>>> I have a SolrCloud cluster (2 machines, 2 Solr instances, 32 shards,
>>>>> replication factor of 2) that I've been using for over a month now in
>>>>> production.
>>>>> 
>>>>> Suddenly, the hourly cron I run that dispatches a delete by query
>>>>> completely halts all indexing. Select queries still run (and quickly),
>>>>> there is no CPU or disk I/O happening, but suddenly my indexer (which
>>>> runs
>>>>> at ~400 doc/sec steady) pauses, and everything blocks indefinitely.
>>>>> 
>>>>> To clarify some on the schema, this is a moving window of data (imagine
>>>>> messages that don't matter after a 24 hour period) which are regularly
>>>>> "chopped" off by my hourly cron (deleting messages over 24 hours old)
>> to
>>>>> keep the index size reasonable.
>>>>> 
>>>>> There are no errors (log level warn) in the logs. I'm not sure what to
>>>> look
>>>>> into. As I've said this has been running (delete included) for about a
>>>>> month.
>>>>> 
>>>>> I'll also note that I have another cluster much like this one where I
>> do
>>>>> the very same thing... it has 4 machines, and indexes 10x the documents
>>>> per
>>>>> second, with more indexes... and yet I delete on a cron without
>> issue...
>>>>> 
>>>>> Any ideas on where to start, or other information I could provide?
>>>>> 
>>>>> Thanks much.
>>>> 
>>>> 
>> 
>> 

Reply via email to