Any chance you can grab the stack trace of a replica as well? (also when it's locked up of course).
- Mark On Mar 6, 2013, at 3:34 PM, Brett Hoerner <br...@bretthoerner.com> wrote: > If there's anything I can try, let me know. Interestingly, I think I have > noticed that if I stop my indexer, do my delete, and restart the indexer > then I'm fine. Which goes along with the update thread contention theory. > > > On Wed, Mar 6, 2013 at 5:03 PM, Mark Miller <markrmil...@gmail.com> wrote: > >> This is what I see: >> >> We currently limit the number of outstanding update requests at one time >> to avoid a crazy number of threads being used. >> >> It looks like a bunch of update requests are stuck in socket reads and are >> taking up the available threads. It looks like the deletes are hanging out >> waiting for a free thread. >> >> It seems the question is, why are the requests stuck in socket reads. I >> don't have an answer at the moment. >> >> We should probably get this into a JIRA issue though. >> >> - Mark >> >> >> On Mar 6, 2013, at 2:15 PM, Alexandre Rafalovitch <arafa...@gmail.com> >> wrote: >> >>> It does not look like a deadlock, though it could be a distributed one. >> Or >>> it could be a livelock, though that's less likely. >>> >>> Here is what we used to recommend in similar situations for large Java >>> systems (BEA Weblogic): >>> 1) Do thread dump of both systems before anything. As simultaneous as you >>> can make it. >>> 2) Do the first delete. Do a thread dump every 2 minutes on both servers >>> (so, say 3 dumps in that 5 minute wait) >>> 3) Do the second delete and do thread dumps every 30 seconds on both >>> servers from just before and then during. Preferably all the way until >> the >>> problem shows itself. Every 5 seconds if the problem shows itself really >>> quick. >>> >>> That gives you a LOT of thread dumps. But it also gives you something >> that >>> allows to compare thread state before and after the problem starting >>> showing itself and to identify moving (or unnaturally still) threads. I >>> even wrote a tool long time ago that parsed those thread dumps >>> automatically and generated pretty deadlock graphs of those. >>> >>> >>> Regards, >>> Alex. >>> >>> >>> >>> >>> >>> Personal blog: http://blog.outerthoughts.com/ >>> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch >>> - Time is the quality of nature that keeps events from happening all at >>> once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) >>> >>> >>> On Wed, Mar 6, 2013 at 5:04 PM, Mark Miller <markrmil...@gmail.com> >> wrote: >>> >>>> Thans Brett, good stuff (though not a good problem). >>>> >>>> We def need to look into this. >>>> >>>> - Mark >>>> >>>> On Mar 6, 2013, at 1:53 PM, Brett Hoerner <br...@bretthoerner.com> >> wrote: >>>> >>>>> Here is a dump after the delete, indexing has been stopped: >>>>> https://gist.github.com/bretthoerner/c7ea3bf3dc9e676a3f0e >>>>> >>>>> An interesting hint that I forgot to mention: it doesn't always happen >> on >>>>> the first delete. I manually ran the delete cron, and the server >>>> continued >>>>> to work. I waited about 5 minutes and ran it again and it stalled the >>>>> indexer (as seen from indexer process): http://i.imgur.com/1Tt35u0.png >>>>> >>>>> Another thing I forgot to mention. To bring the cluster back to life I: >>>>> >>>>> 1) stop my indexer >>>>> 2) stop server1, start server1 >>>>> 3) stop server2, start start2 >>>>> 4) manually rebalance half of the shards to be mastered on server2 >>>>> (unload/create on server1) >>>>> 5) restart indexer >>>>> >>>>> And it works again until a delete eventually kills it. >>>>> >>>>> To be clear again, select queries continue to work indefinitely. >>>>> >>>>> Thanks, >>>>> Brett >>>>> >>>>> >>>>> On Wed, Mar 6, 2013 at 1:50 PM, Mark Miller <markrmil...@gmail.com> >>>> wrote: >>>>> >>>>>> Which version of Solr? >>>>>> >>>>>> Can you use jconsole, visualvm, or jstack to get some stack traces and >>>> see >>>>>> where things are halting? >>>>>> >>>>>> - Mark >>>>>> >>>>>> On Mar 6, 2013, at 11:45 AM, Brett Hoerner <br...@bretthoerner.com> >>>> wrote: >>>>>> >>>>>>> I have a SolrCloud cluster (2 machines, 2 Solr instances, 32 shards, >>>>>>> replication factor of 2) that I've been using for over a month now in >>>>>>> production. >>>>>>> >>>>>>> Suddenly, the hourly cron I run that dispatches a delete by query >>>>>>> completely halts all indexing. Select queries still run (and >> quickly), >>>>>>> there is no CPU or disk I/O happening, but suddenly my indexer (which >>>>>> runs >>>>>>> at ~400 doc/sec steady) pauses, and everything blocks indefinitely. >>>>>>> >>>>>>> To clarify some on the schema, this is a moving window of data >> (imagine >>>>>>> messages that don't matter after a 24 hour period) which are >> regularly >>>>>>> "chopped" off by my hourly cron (deleting messages over 24 hours old) >>>> to >>>>>>> keep the index size reasonable. >>>>>>> >>>>>>> There are no errors (log level warn) in the logs. I'm not sure what >> to >>>>>> look >>>>>>> into. As I've said this has been running (delete included) for about >> a >>>>>>> month. >>>>>>> >>>>>>> I'll also note that I have another cluster much like this one where I >>>> do >>>>>>> the very same thing... it has 4 machines, and indexes 10x the >> documents >>>>>> per >>>>>>> second, with more indexes... and yet I delete on a cron without >>>> issue... >>>>>>> >>>>>>> Any ideas on where to start, or other information I could provide? >>>>>>> >>>>>>> Thanks much. >>>>>> >>>>>> >>>> >>>> >> >>