Very low-tech and manual, but worth mentioning...

If there's a particularly large core that's doing a full recovery, and
you have access to the disk itself you can navigate to the relevant
directory for that core and run something like "watch -n 10 ls -lah"
or "watch -n 10 du -sh ." to see how the data transfer is going.

On Fri, Feb 7, 2020 at 11:16 AM Walter Underwood <wun...@wunderwood.org> wrote:
>
> I wrote some Python that checks CLUSTERSTATUS and reports replica status to 
> Telegraf. Great for charts and alerts, but it only shows status, not progress.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Feb 7, 2020, at 7:58 AM, Erick Erickson <erickerick...@gmail.com> wrote:
> >
> > I was wondering about using metrics myself. I confess I didn’t look to see 
> > what was already there either ;)
> >
> > Actually, using metrics might be easiest all told, but I also confess I 
> > have no clue what it takes to build a new metric in. Nor how to use the 
> > same (?) collection process for the 5 situations I outlined, and those just 
> > off the top of my head.
> >
> > It’s particularly frustrating when diagnosing these not knowing whether the 
> > “recovering” state is going to resolve itself sometime or not. I’ve seen 
> > Solr replicas stuck in that state forever….
> >
> > Andrzej could certainly shed some light on that question.
> >
> > All ideas welcome of course!
> >
> >> On Feb 7, 2020, at 10:40 AM, Jan Høydahl <jan....@cominvent.com> wrote:
> >>
> >> Could we expose some high level recovery info as part of metrics api? Then 
> >> people could track number of cores recovering, recovery time, recovery 
> >> phase, number of recoveries failed etc, and also build alerts on top of 
> >> that.
> >>
> >> Jan Høydahl
> >>
> >>> 6. feb. 2020 kl. 19:42 skrev Erick Erickson <erickerick...@gmail.com>:
> >>>
> >>> There’s actually a crying need for this, but there’s nothing that’s 
> >>> there yet, basically you have to look at the log files and try to figure 
> >>> it out.
> >>>
> >>> Actually I think this would be a great thing to work on, but it’d be 
> >>> pretty much all new. If you’d like, you can create a Solr Improvement 
> >>> Proposal here: 
> >>> https://cwiki.apache.org/confluence/display/SOLR/SIP+Template to flesh 
> >>> out what this would look like.
> >>>
> >>> A couple of thoughts off the top of my head:
> >>>
> >>> I really think what would be most useful would be a collections API 
> >>> command, something like “RECOVERYSTATUS”, or maybe extend CLUSTERSTATUS. 
> >>> Currently a replica can be stuck in recovery and never get out. There are 
> >>> several scenarios that’d have to be considered:
> >>>
> >>> 1> normal startup. The replica briefly goes from down->recovering->active 
> >>> which should be quite brief.
> >>> 1a> Waiting for a leader to be elected before continuing
> >>>
> >>> 2> “peer sync” where another replica is replaying documents from the tlog.
> >>>
> >>> 3> situations where the replica is replaying documents from its own tlog. 
> >>> This can be very, very, very long too.
> >>>
> >>> 4> full sync where it’s copying the entire index from a leader.
> >>>
> >>> 5> knickers in a knot, it’s given up even trying to recover.
> >>>
> >>> In either case, you’d want to report “all ok” if nothing was in recovery, 
> >>> “just the ones having trouble” and “everything because I want to look”.
> >>>
> >>> But like I said, there’s nothing really built into the system to 
> >>> accomplish this now that I know of.
> >>>
> >>> Best,
> >>> Erick
> >>>
> >>>> On Feb 6, 2020, at 12:15 PM, dj-manning <derek.mann...@superna.net> 
> >>>> wrote:
> >>>>
> >>>> Erick Erickson wrote
> >>>>> When you say “look”, where are you looking from? Http requests? SolrJ? 
> >>>>> The
> >>>>> admin UI?
> >>>>
> >>>> I'm open to looking form anywhere  - http request, or the admin UI, or
> >>>> following a log if possible.
> >>>>
> >>>> My objective for this ask would be to human interactively follow/watch
> >>>> solr's recovery progress - if that's even possible.
> >>>>
> >>>> Stretch goal would be to autonomously report on recovery progress.
> >>>>
> >>>> The question stems from seeing recovery in log or the admin UI, then
> >>>> wondering what progress is.
> >>>>
> >>>> Appreciation.
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
> >>>
> >
>

Reply via email to