Here is the other server when it's locked:
https://gist.github.com/3529b7b6415756ead413

To be clear, neither is really "the replica", I have 32 shards and each
physical server is the leader for 16, and the replica for 16.

Also, related to the max threads hunch: my working cluster has many, many
fewer shards per Solr instance. I'm going to do some migration dancing on
this cluster today to have more Solr JVMs each with fewer cores, and see
how it affects the deletes.


On Wed, Mar 6, 2013 at 5:40 PM, Mark Miller <markrmil...@gmail.com> wrote:

> Any chance you can grab the stack trace of a replica as well? (also when
> it's locked up of course).
>
> - Mark
>
> On Mar 6, 2013, at 3:34 PM, Brett Hoerner <br...@bretthoerner.com> wrote:
>
> > If there's anything I can try, let me know. Interestingly, I think I have
> > noticed that if I stop my indexer, do my delete, and restart the indexer
> > then I'm fine. Which goes along with the update thread contention theory.
> >
> >
> > On Wed, Mar 6, 2013 at 5:03 PM, Mark Miller <markrmil...@gmail.com>
> wrote:
> >
> >> This is what I see:
> >>
> >> We currently limit the number of outstanding update requests at one time
> >> to avoid a crazy number of threads being used.
> >>
> >> It looks like a bunch of update requests are stuck in socket reads and
> are
> >> taking up the available threads. It looks like the deletes are hanging
> out
> >> waiting for a free thread.
> >>
> >> It seems the question is, why are the requests stuck in socket reads. I
> >> don't have an answer at the moment.
> >>
> >> We should probably get this into a JIRA issue though.
> >>
> >> - Mark
> >>
> >>
> >> On Mar 6, 2013, at 2:15 PM, Alexandre Rafalovitch <arafa...@gmail.com>
> >> wrote:
> >>
> >>> It does not look like a deadlock, though it could be a distributed one.
> >> Or
> >>> it could be a livelock, though that's less likely.
> >>>
> >>> Here is what we used to recommend in similar situations for large Java
> >>> systems (BEA Weblogic):
> >>> 1) Do thread dump of both systems before anything. As simultaneous as
> you
> >>> can make it.
> >>> 2) Do the first delete. Do a thread dump every 2 minutes on both
> servers
> >>> (so, say 3 dumps in that 5 minute wait)
> >>> 3) Do the second delete and do thread dumps every 30 seconds on both
> >>> servers from just before and then during. Preferably all the way until
> >> the
> >>> problem shows itself. Every 5 seconds if the problem shows itself
> really
> >>> quick.
> >>>
> >>> That gives you a LOT of thread dumps. But it also gives you something
> >> that
> >>> allows to compare thread state before and after the problem starting
> >>> showing itself and to identify moving (or unnaturally still) threads. I
> >>> even wrote a tool long time ago that parsed those thread dumps
> >>> automatically and generated pretty deadlock graphs of those.
> >>>
> >>>
> >>> Regards,
> >>>  Alex.
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> Personal blog: http://blog.outerthoughts.com/
> >>> LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> >>> - Time is the quality of nature that keeps events from happening all at
> >>> once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
> book)
> >>>
> >>>
> >>> On Wed, Mar 6, 2013 at 5:04 PM, Mark Miller <markrmil...@gmail.com>
> >> wrote:
> >>>
> >>>> Thans Brett, good stuff (though not a good problem).
> >>>>
> >>>> We def need to look into this.
> >>>>
> >>>> - Mark
> >>>>
> >>>> On Mar 6, 2013, at 1:53 PM, Brett Hoerner <br...@bretthoerner.com>
> >> wrote:
> >>>>
> >>>>> Here is a dump after the delete, indexing has been stopped:
> >>>>> https://gist.github.com/bretthoerner/c7ea3bf3dc9e676a3f0e
> >>>>>
> >>>>> An interesting hint that I forgot to mention: it doesn't always
> happen
> >> on
> >>>>> the first delete. I manually ran the delete cron, and the server
> >>>> continued
> >>>>> to work. I waited about 5 minutes and ran it again and it stalled the
> >>>>> indexer (as seen from indexer process):
> http://i.imgur.com/1Tt35u0.png
> >>>>>
> >>>>> Another thing I forgot to mention. To bring the cluster back to life
> I:
> >>>>>
> >>>>> 1) stop my indexer
> >>>>> 2) stop server1, start server1
> >>>>> 3) stop server2, start start2
> >>>>> 4) manually rebalance half of the shards to be mastered on server2
> >>>>> (unload/create on server1)
> >>>>> 5) restart indexer
> >>>>>
> >>>>> And it works again until a delete eventually kills it.
> >>>>>
> >>>>> To be clear again, select queries continue to work indefinitely.
> >>>>>
> >>>>> Thanks,
> >>>>> Brett
> >>>>>
> >>>>>
> >>>>> On Wed, Mar 6, 2013 at 1:50 PM, Mark Miller <markrmil...@gmail.com>
> >>>> wrote:
> >>>>>
> >>>>>> Which version of Solr?
> >>>>>>
> >>>>>> Can you use jconsole, visualvm, or jstack to get some stack traces
> and
> >>>> see
> >>>>>> where things are halting?
> >>>>>>
> >>>>>> - Mark
> >>>>>>
> >>>>>> On Mar 6, 2013, at 11:45 AM, Brett Hoerner <br...@bretthoerner.com>
> >>>> wrote:
> >>>>>>
> >>>>>>> I have a SolrCloud cluster (2 machines, 2 Solr instances, 32
> shards,
> >>>>>>> replication factor of 2) that I've been using for over a month now
> in
> >>>>>>> production.
> >>>>>>>
> >>>>>>> Suddenly, the hourly cron I run that dispatches a delete by query
> >>>>>>> completely halts all indexing. Select queries still run (and
> >> quickly),
> >>>>>>> there is no CPU or disk I/O happening, but suddenly my indexer
> (which
> >>>>>> runs
> >>>>>>> at ~400 doc/sec steady) pauses, and everything blocks indefinitely.
> >>>>>>>
> >>>>>>> To clarify some on the schema, this is a moving window of data
> >> (imagine
> >>>>>>> messages that don't matter after a 24 hour period) which are
> >> regularly
> >>>>>>> "chopped" off by my hourly cron (deleting messages over 24 hours
> old)
> >>>> to
> >>>>>>> keep the index size reasonable.
> >>>>>>>
> >>>>>>> There are no errors (log level warn) in the logs. I'm not sure what
> >> to
> >>>>>> look
> >>>>>>> into. As I've said this has been running (delete included) for
> about
> >> a
> >>>>>>> month.
> >>>>>>>
> >>>>>>> I'll also note that I have another cluster much like this one
> where I
> >>>> do
> >>>>>>> the very same thing... it has 4 machines, and indexes 10x the
> >> documents
> >>>>>> per
> >>>>>>> second, with more indexes... and yet I delete on a cron without
> >>>> issue...
> >>>>>>>
> >>>>>>> Any ideas on where to start, or other information I could provide?
> >>>>>>>
> >>>>>>> Thanks much.
> >>>>>>
> >>>>>>
> >>>>
> >>>>
> >>
> >>
>
>

Reply via email to