If there's anything I can try, let me know. Interestingly, I think I have noticed that if I stop my indexer, do my delete, and restart the indexer then I'm fine. Which goes along with the update thread contention theory.
On Wed, Mar 6, 2013 at 5:03 PM, Mark Miller <markrmil...@gmail.com> wrote: > This is what I see: > > We currently limit the number of outstanding update requests at one time > to avoid a crazy number of threads being used. > > It looks like a bunch of update requests are stuck in socket reads and are > taking up the available threads. It looks like the deletes are hanging out > waiting for a free thread. > > It seems the question is, why are the requests stuck in socket reads. I > don't have an answer at the moment. > > We should probably get this into a JIRA issue though. > > - Mark > > > On Mar 6, 2013, at 2:15 PM, Alexandre Rafalovitch <arafa...@gmail.com> > wrote: > > > It does not look like a deadlock, though it could be a distributed one. > Or > > it could be a livelock, though that's less likely. > > > > Here is what we used to recommend in similar situations for large Java > > systems (BEA Weblogic): > > 1) Do thread dump of both systems before anything. As simultaneous as you > > can make it. > > 2) Do the first delete. Do a thread dump every 2 minutes on both servers > > (so, say 3 dumps in that 5 minute wait) > > 3) Do the second delete and do thread dumps every 30 seconds on both > > servers from just before and then during. Preferably all the way until > the > > problem shows itself. Every 5 seconds if the problem shows itself really > > quick. > > > > That gives you a LOT of thread dumps. But it also gives you something > that > > allows to compare thread state before and after the problem starting > > showing itself and to identify moving (or unnaturally still) threads. I > > even wrote a tool long time ago that parsed those thread dumps > > automatically and generated pretty deadlock graphs of those. > > > > > > Regards, > > Alex. > > > > > > > > > > > > Personal blog: http://blog.outerthoughts.com/ > > LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch > > - Time is the quality of nature that keeps events from happening all at > > once. Lately, it doesn't seem to be working. (Anonymous - via GTD book) > > > > > > On Wed, Mar 6, 2013 at 5:04 PM, Mark Miller <markrmil...@gmail.com> > wrote: > > > >> Thans Brett, good stuff (though not a good problem). > >> > >> We def need to look into this. > >> > >> - Mark > >> > >> On Mar 6, 2013, at 1:53 PM, Brett Hoerner <br...@bretthoerner.com> > wrote: > >> > >>> Here is a dump after the delete, indexing has been stopped: > >>> https://gist.github.com/bretthoerner/c7ea3bf3dc9e676a3f0e > >>> > >>> An interesting hint that I forgot to mention: it doesn't always happen > on > >>> the first delete. I manually ran the delete cron, and the server > >> continued > >>> to work. I waited about 5 minutes and ran it again and it stalled the > >>> indexer (as seen from indexer process): http://i.imgur.com/1Tt35u0.png > >>> > >>> Another thing I forgot to mention. To bring the cluster back to life I: > >>> > >>> 1) stop my indexer > >>> 2) stop server1, start server1 > >>> 3) stop server2, start start2 > >>> 4) manually rebalance half of the shards to be mastered on server2 > >>> (unload/create on server1) > >>> 5) restart indexer > >>> > >>> And it works again until a delete eventually kills it. > >>> > >>> To be clear again, select queries continue to work indefinitely. > >>> > >>> Thanks, > >>> Brett > >>> > >>> > >>> On Wed, Mar 6, 2013 at 1:50 PM, Mark Miller <markrmil...@gmail.com> > >> wrote: > >>> > >>>> Which version of Solr? > >>>> > >>>> Can you use jconsole, visualvm, or jstack to get some stack traces and > >> see > >>>> where things are halting? > >>>> > >>>> - Mark > >>>> > >>>> On Mar 6, 2013, at 11:45 AM, Brett Hoerner <br...@bretthoerner.com> > >> wrote: > >>>> > >>>>> I have a SolrCloud cluster (2 machines, 2 Solr instances, 32 shards, > >>>>> replication factor of 2) that I've been using for over a month now in > >>>>> production. > >>>>> > >>>>> Suddenly, the hourly cron I run that dispatches a delete by query > >>>>> completely halts all indexing. Select queries still run (and > quickly), > >>>>> there is no CPU or disk I/O happening, but suddenly my indexer (which > >>>> runs > >>>>> at ~400 doc/sec steady) pauses, and everything blocks indefinitely. > >>>>> > >>>>> To clarify some on the schema, this is a moving window of data > (imagine > >>>>> messages that don't matter after a 24 hour period) which are > regularly > >>>>> "chopped" off by my hourly cron (deleting messages over 24 hours old) > >> to > >>>>> keep the index size reasonable. > >>>>> > >>>>> There are no errors (log level warn) in the logs. I'm not sure what > to > >>>> look > >>>>> into. As I've said this has been running (delete included) for about > a > >>>>> month. > >>>>> > >>>>> I'll also note that I have another cluster much like this one where I > >> do > >>>>> the very same thing... it has 4 machines, and indexes 10x the > documents > >>>> per > >>>>> second, with more indexes... and yet I delete on a cron without > >> issue... > >>>>> > >>>>> Any ideas on where to start, or other information I could provide? > >>>>> > >>>>> Thanks much. > >>>> > >>>> > >> > >> > >