Cool, useful info. As soon as I can duplicate the issue I'll work out what we need to do differently for this case.
- Mark On Mar 7, 2013, at 10:19 AM, Brett Hoerner <[email protected]> wrote: > As an update to this, I did my SolrCloud dance and made it 2xJVMs per > machine (2 machines still, the same ones) and spread the load around. Each > Solr instance now has 16 total shards (master for 8, replica for 8). > > *drum roll* ... I can repeatedly run my delete script and nothing breaks. :) > > > On Thu, Mar 7, 2013 at 11:03 AM, Brett Hoerner <[email protected]>wrote: > >> Here is the other server when it's locked: >> https://gist.github.com/3529b7b6415756ead413 >> >> To be clear, neither is really "the replica", I have 32 shards and each >> physical server is the leader for 16, and the replica for 16. >> >> Also, related to the max threads hunch: my working cluster has many, many >> fewer shards per Solr instance. I'm going to do some migration dancing on >> this cluster today to have more Solr JVMs each with fewer cores, and see >> how it affects the deletes. >> >> >> On Wed, Mar 6, 2013 at 5:40 PM, Mark Miller <[email protected]> wrote: >> >>> Any chance you can grab the stack trace of a replica as well? (also when >>> it's locked up of course). >>> >>> - Mark >>> >>> On Mar 6, 2013, at 3:34 PM, Brett Hoerner <[email protected]> wrote: >>> >>>> If there's anything I can try, let me know. Interestingly, I think I >>> have >>>> noticed that if I stop my indexer, do my delete, and restart the indexer >>>> then I'm fine. Which goes along with the update thread contention >>> theory. >>>> >>>> >>>> On Wed, Mar 6, 2013 at 5:03 PM, Mark Miller <[email protected]> >>> wrote: >>>> >>>>> This is what I see: >>>>> >>>>> We currently limit the number of outstanding update requests at one >>> time >>>>> to avoid a crazy number of threads being used. >>>>> >>>>> It looks like a bunch of update requests are stuck in socket reads and >>> are >>>>> taking up the available threads. It looks like the deletes are hanging >>> out >>>>> waiting for a free thread. >>>>> >>>>> It seems the question is, why are the requests stuck in socket reads. I >>>>> don't have an answer at the moment. >>>>> >>>>> We should probably get this into a JIRA issue though. >>>>> >>>>> - Mark >>>>> >>>>> >>>>> On Mar 6, 2013, at 2:15 PM, Alexandre Rafalovitch <[email protected]> >>>>> wrote: >>>>> >>>>>> It does not look like a deadlock, though it could be a distributed >>> one. >>>>> Or >>>>>> it could be a livelock, though that's less likely. >>>>>> >>>>>> Here is what we used to recommend in similar situations for large Java >>>>>> systems (BEA Weblogic): >>>>>> 1) Do thread dump of both systems before anything. As simultaneous as >>> you >>>>>> can make it. >>>>>> 2) Do the first delete. Do a thread dump every 2 minutes on both >>> servers >>>>>> (so, say 3 dumps in that 5 minute wait) >>>>>> 3) Do the second delete and do thread dumps every 30 seconds on both >>>>>> servers from just before and then during. Preferably all the way until >>>>> the >>>>>> problem shows itself. Every 5 seconds if the problem shows itself >>> really >>>>>> quick. >>>>>> >>>>>> That gives you a LOT of thread dumps. But it also gives you something >>>>> that >>>>>> allows to compare thread state before and after the problem starting >>>>>> showing itself and to identify moving (or unnaturally still) threads. >>> I >>>>>> even wrote a tool long time ago that parsed those thread dumps >>>>>> automatically and generated pretty deadlock graphs of those. >>>>>> >>>>>> >>>>>> Regards, >>>>>> Alex. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> Personal blog: http://blog.outerthoughts.com/ >>>>>> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch >>>>>> - Time is the quality of nature that keeps events from happening all >>> at >>>>>> once. Lately, it doesn't seem to be working. (Anonymous - via GTD >>> book) >>>>>> >>>>>> >>>>>> On Wed, Mar 6, 2013 at 5:04 PM, Mark Miller <[email protected]> >>>>> wrote: >>>>>> >>>>>>> Thans Brett, good stuff (though not a good problem). >>>>>>> >>>>>>> We def need to look into this. >>>>>>> >>>>>>> - Mark >>>>>>> >>>>>>> On Mar 6, 2013, at 1:53 PM, Brett Hoerner <[email protected]> >>>>> wrote: >>>>>>> >>>>>>>> Here is a dump after the delete, indexing has been stopped: >>>>>>>> https://gist.github.com/bretthoerner/c7ea3bf3dc9e676a3f0e >>>>>>>> >>>>>>>> An interesting hint that I forgot to mention: it doesn't always >>> happen >>>>> on >>>>>>>> the first delete. I manually ran the delete cron, and the server >>>>>>> continued >>>>>>>> to work. I waited about 5 minutes and ran it again and it stalled >>> the >>>>>>>> indexer (as seen from indexer process): >>> http://i.imgur.com/1Tt35u0.png >>>>>>>> >>>>>>>> Another thing I forgot to mention. To bring the cluster back to >>> life I: >>>>>>>> >>>>>>>> 1) stop my indexer >>>>>>>> 2) stop server1, start server1 >>>>>>>> 3) stop server2, start start2 >>>>>>>> 4) manually rebalance half of the shards to be mastered on server2 >>>>>>>> (unload/create on server1) >>>>>>>> 5) restart indexer >>>>>>>> >>>>>>>> And it works again until a delete eventually kills it. >>>>>>>> >>>>>>>> To be clear again, select queries continue to work indefinitely. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Brett >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Mar 6, 2013 at 1:50 PM, Mark Miller <[email protected]> >>>>>>> wrote: >>>>>>>> >>>>>>>>> Which version of Solr? >>>>>>>>> >>>>>>>>> Can you use jconsole, visualvm, or jstack to get some stack traces >>> and >>>>>>> see >>>>>>>>> where things are halting? >>>>>>>>> >>>>>>>>> - Mark >>>>>>>>> >>>>>>>>> On Mar 6, 2013, at 11:45 AM, Brett Hoerner <[email protected] >>>> >>>>>>> wrote: >>>>>>>>> >>>>>>>>>> I have a SolrCloud cluster (2 machines, 2 Solr instances, 32 >>> shards, >>>>>>>>>> replication factor of 2) that I've been using for over a month >>> now in >>>>>>>>>> production. >>>>>>>>>> >>>>>>>>>> Suddenly, the hourly cron I run that dispatches a delete by query >>>>>>>>>> completely halts all indexing. Select queries still run (and >>>>> quickly), >>>>>>>>>> there is no CPU or disk I/O happening, but suddenly my indexer >>> (which >>>>>>>>> runs >>>>>>>>>> at ~400 doc/sec steady) pauses, and everything blocks >>> indefinitely. >>>>>>>>>> >>>>>>>>>> To clarify some on the schema, this is a moving window of data >>>>> (imagine >>>>>>>>>> messages that don't matter after a 24 hour period) which are >>>>> regularly >>>>>>>>>> "chopped" off by my hourly cron (deleting messages over 24 hours >>> old) >>>>>>> to >>>>>>>>>> keep the index size reasonable. >>>>>>>>>> >>>>>>>>>> There are no errors (log level warn) in the logs. I'm not sure >>> what >>>>> to >>>>>>>>> look >>>>>>>>>> into. As I've said this has been running (delete included) for >>> about >>>>> a >>>>>>>>>> month. >>>>>>>>>> >>>>>>>>>> I'll also note that I have another cluster much like this one >>> where I >>>>>>> do >>>>>>>>>> the very same thing... it has 4 machines, and indexes 10x the >>>>> documents >>>>>>>>> per >>>>>>>>>> second, with more indexes... and yet I delete on a cron without >>>>>>> issue... >>>>>>>>>> >>>>>>>>>> Any ideas on where to start, or other information I could provide? >>>>>>>>>> >>>>>>>>>> Thanks much. >>>>>>>>> >>>>>>>>> >>>>>>> >>>>>>> >>>>> >>>>> >>> >>> >>
