If there's anything I can try, let me know. Interestingly, I think I have
noticed that if I stop my indexer, do my delete, and restart the indexer
then I'm fine. Which goes along with the update thread contention theory.


On Wed, Mar 6, 2013 at 5:03 PM, Mark Miller <markrmil...@gmail.com> wrote:

> This is what I see:
>
> We currently limit the number of outstanding update requests at one time
> to avoid a crazy number of threads being used.
>
> It looks like a bunch of update requests are stuck in socket reads and are
> taking up the available threads. It looks like the deletes are hanging out
> waiting for a free thread.
>
> It seems the question is, why are the requests stuck in socket reads. I
> don't have an answer at the moment.
>
> We should probably get this into a JIRA issue though.
>
> - Mark
>
>
> On Mar 6, 2013, at 2:15 PM, Alexandre Rafalovitch <arafa...@gmail.com>
> wrote:
>
> > It does not look like a deadlock, though it could be a distributed one.
> Or
> > it could be a livelock, though that's less likely.
> >
> > Here is what we used to recommend in similar situations for large Java
> > systems (BEA Weblogic):
> > 1) Do thread dump of both systems before anything. As simultaneous as you
> > can make it.
> > 2) Do the first delete. Do a thread dump every 2 minutes on both servers
> > (so, say 3 dumps in that 5 minute wait)
> > 3) Do the second delete and do thread dumps every 30 seconds on both
> > servers from just before and then during. Preferably all the way until
> the
> > problem shows itself. Every 5 seconds if the problem shows itself really
> > quick.
> >
> > That gives you a LOT of thread dumps. But it also gives you something
> that
> > allows to compare thread state before and after the problem starting
> > showing itself and to identify moving (or unnaturally still) threads. I
> > even wrote a tool long time ago that parsed those thread dumps
> > automatically and generated pretty deadlock graphs of those.
> >
> >
> > Regards,
> >   Alex.
> >
> >
> >
> >
> >
> > Personal blog: http://blog.outerthoughts.com/
> > LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> > - Time is the quality of nature that keeps events from happening all at
> > once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD book)
> >
> >
> > On Wed, Mar 6, 2013 at 5:04 PM, Mark Miller <markrmil...@gmail.com>
> wrote:
> >
> >> Thans Brett, good stuff (though not a good problem).
> >>
> >> We def need to look into this.
> >>
> >> - Mark
> >>
> >> On Mar 6, 2013, at 1:53 PM, Brett Hoerner <br...@bretthoerner.com>
> wrote:
> >>
> >>> Here is a dump after the delete, indexing has been stopped:
> >>> https://gist.github.com/bretthoerner/c7ea3bf3dc9e676a3f0e
> >>>
> >>> An interesting hint that I forgot to mention: it doesn't always happen
> on
> >>> the first delete. I manually ran the delete cron, and the server
> >> continued
> >>> to work. I waited about 5 minutes and ran it again and it stalled the
> >>> indexer (as seen from indexer process): http://i.imgur.com/1Tt35u0.png
> >>>
> >>> Another thing I forgot to mention. To bring the cluster back to life I:
> >>>
> >>> 1) stop my indexer
> >>> 2) stop server1, start server1
> >>> 3) stop server2, start start2
> >>> 4) manually rebalance half of the shards to be mastered on server2
> >>> (unload/create on server1)
> >>> 5) restart indexer
> >>>
> >>> And it works again until a delete eventually kills it.
> >>>
> >>> To be clear again, select queries continue to work indefinitely.
> >>>
> >>> Thanks,
> >>> Brett
> >>>
> >>>
> >>> On Wed, Mar 6, 2013 at 1:50 PM, Mark Miller <markrmil...@gmail.com>
> >> wrote:
> >>>
> >>>> Which version of Solr?
> >>>>
> >>>> Can you use jconsole, visualvm, or jstack to get some stack traces and
> >> see
> >>>> where things are halting?
> >>>>
> >>>> - Mark
> >>>>
> >>>> On Mar 6, 2013, at 11:45 AM, Brett Hoerner <br...@bretthoerner.com>
> >> wrote:
> >>>>
> >>>>> I have a SolrCloud cluster (2 machines, 2 Solr instances, 32 shards,
> >>>>> replication factor of 2) that I've been using for over a month now in
> >>>>> production.
> >>>>>
> >>>>> Suddenly, the hourly cron I run that dispatches a delete by query
> >>>>> completely halts all indexing. Select queries still run (and
> quickly),
> >>>>> there is no CPU or disk I/O happening, but suddenly my indexer (which
> >>>> runs
> >>>>> at ~400 doc/sec steady) pauses, and everything blocks indefinitely.
> >>>>>
> >>>>> To clarify some on the schema, this is a moving window of data
> (imagine
> >>>>> messages that don't matter after a 24 hour period) which are
> regularly
> >>>>> "chopped" off by my hourly cron (deleting messages over 24 hours old)
> >> to
> >>>>> keep the index size reasonable.
> >>>>>
> >>>>> There are no errors (log level warn) in the logs. I'm not sure what
> to
> >>>> look
> >>>>> into. As I've said this has been running (delete included) for about
> a
> >>>>> month.
> >>>>>
> >>>>> I'll also note that I have another cluster much like this one where I
> >> do
> >>>>> the very same thing... it has 4 machines, and indexes 10x the
> documents
> >>>> per
> >>>>> second, with more indexes... and yet I delete on a cron without
> >> issue...
> >>>>>
> >>>>> Any ideas on where to start, or other information I could provide?
> >>>>>
> >>>>> Thanks much.
> >>>>
> >>>>
> >>
> >>
>
>

Reply via email to