Any chance you can grab the stack trace of a replica as well? (also when it's 
locked up of course).

- Mark

On Mar 6, 2013, at 3:34 PM, Brett Hoerner <br...@bretthoerner.com> wrote:

> If there's anything I can try, let me know. Interestingly, I think I have
> noticed that if I stop my indexer, do my delete, and restart the indexer
> then I'm fine. Which goes along with the update thread contention theory.
> 
> 
> On Wed, Mar 6, 2013 at 5:03 PM, Mark Miller <markrmil...@gmail.com> wrote:
> 
>> This is what I see:
>> 
>> We currently limit the number of outstanding update requests at one time
>> to avoid a crazy number of threads being used.
>> 
>> It looks like a bunch of update requests are stuck in socket reads and are
>> taking up the available threads. It looks like the deletes are hanging out
>> waiting for a free thread.
>> 
>> It seems the question is, why are the requests stuck in socket reads. I
>> don't have an answer at the moment.
>> 
>> We should probably get this into a JIRA issue though.
>> 
>> - Mark
>> 
>> 
>> On Mar 6, 2013, at 2:15 PM, Alexandre Rafalovitch <arafa...@gmail.com>
>> wrote:
>> 
>>> It does not look like a deadlock, though it could be a distributed one.
>> Or
>>> it could be a livelock, though that's less likely.
>>> 
>>> Here is what we used to recommend in similar situations for large Java
>>> systems (BEA Weblogic):
>>> 1) Do thread dump of both systems before anything. As simultaneous as you
>>> can make it.
>>> 2) Do the first delete. Do a thread dump every 2 minutes on both servers
>>> (so, say 3 dumps in that 5 minute wait)
>>> 3) Do the second delete and do thread dumps every 30 seconds on both
>>> servers from just before and then during. Preferably all the way until
>> the
>>> problem shows itself. Every 5 seconds if the problem shows itself really
>>> quick.
>>> 
>>> That gives you a LOT of thread dumps. But it also gives you something
>> that
>>> allows to compare thread state before and after the problem starting
>>> showing itself and to identify moving (or unnaturally still) threads. I
>>> even wrote a tool long time ago that parsed those thread dumps
>>> automatically and generated pretty deadlock graphs of those.
>>> 
>>> 
>>> Regards,
>>>  Alex.
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Personal blog: http://blog.outerthoughts.com/
>>> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
>>> - Time is the quality of nature that keeps events from happening all at
>>> once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)
>>> 
>>> 
>>> On Wed, Mar 6, 2013 at 5:04 PM, Mark Miller <markrmil...@gmail.com>
>> wrote:
>>> 
>>>> Thans Brett, good stuff (though not a good problem).
>>>> 
>>>> We def need to look into this.
>>>> 
>>>> - Mark
>>>> 
>>>> On Mar 6, 2013, at 1:53 PM, Brett Hoerner <br...@bretthoerner.com>
>> wrote:
>>>> 
>>>>> Here is a dump after the delete, indexing has been stopped:
>>>>> https://gist.github.com/bretthoerner/c7ea3bf3dc9e676a3f0e
>>>>> 
>>>>> An interesting hint that I forgot to mention: it doesn't always happen
>> on
>>>>> the first delete. I manually ran the delete cron, and the server
>>>> continued
>>>>> to work. I waited about 5 minutes and ran it again and it stalled the
>>>>> indexer (as seen from indexer process): http://i.imgur.com/1Tt35u0.png
>>>>> 
>>>>> Another thing I forgot to mention. To bring the cluster back to life I:
>>>>> 
>>>>> 1) stop my indexer
>>>>> 2) stop server1, start server1
>>>>> 3) stop server2, start start2
>>>>> 4) manually rebalance half of the shards to be mastered on server2
>>>>> (unload/create on server1)
>>>>> 5) restart indexer
>>>>> 
>>>>> And it works again until a delete eventually kills it.
>>>>> 
>>>>> To be clear again, select queries continue to work indefinitely.
>>>>> 
>>>>> Thanks,
>>>>> Brett
>>>>> 
>>>>> 
>>>>> On Wed, Mar 6, 2013 at 1:50 PM, Mark Miller <markrmil...@gmail.com>
>>>> wrote:
>>>>> 
>>>>>> Which version of Solr?
>>>>>> 
>>>>>> Can you use jconsole, visualvm, or jstack to get some stack traces and
>>>> see
>>>>>> where things are halting?
>>>>>> 
>>>>>> - Mark
>>>>>> 
>>>>>> On Mar 6, 2013, at 11:45 AM, Brett Hoerner <br...@bretthoerner.com>
>>>> wrote:
>>>>>> 
>>>>>>> I have a SolrCloud cluster (2 machines, 2 Solr instances, 32 shards,
>>>>>>> replication factor of 2) that I've been using for over a month now in
>>>>>>> production.
>>>>>>> 
>>>>>>> Suddenly, the hourly cron I run that dispatches a delete by query
>>>>>>> completely halts all indexing. Select queries still run (and
>> quickly),
>>>>>>> there is no CPU or disk I/O happening, but suddenly my indexer (which
>>>>>> runs
>>>>>>> at ~400 doc/sec steady) pauses, and everything blocks indefinitely.
>>>>>>> 
>>>>>>> To clarify some on the schema, this is a moving window of data
>> (imagine
>>>>>>> messages that don't matter after a 24 hour period) which are
>> regularly
>>>>>>> "chopped" off by my hourly cron (deleting messages over 24 hours old)
>>>> to
>>>>>>> keep the index size reasonable.
>>>>>>> 
>>>>>>> There are no errors (log level warn) in the logs. I'm not sure what
>> to
>>>>>> look
>>>>>>> into. As I've said this has been running (delete included) for about
>> a
>>>>>>> month.
>>>>>>> 
>>>>>>> I'll also note that I have another cluster much like this one where I
>>>> do
>>>>>>> the very same thing... it has 4 machines, and indexes 10x the
>> documents
>>>>>> per
>>>>>>> second, with more indexes... and yet I delete on a cron without
>>>> issue...
>>>>>>> 
>>>>>>> Any ideas on where to start, or other information I could provide?
>>>>>>> 
>>>>>>> Thanks much.
>>>>>> 
>>>>>> 
>>>> 
>>>> 
>> 
>> 

Reply via email to