Very low-tech and manual, but worth mentioning... If there's a particularly large core that's doing a full recovery, and you have access to the disk itself you can navigate to the relevant directory for that core and run something like "watch -n 10 ls -lah" or "watch -n 10 du -sh ." to see how the data transfer is going.
On Fri, Feb 7, 2020 at 11:16 AM Walter Underwood <wun...@wunderwood.org> wrote: > > I wrote some Python that checks CLUSTERSTATUS and reports replica status to > Telegraf. Great for charts and alerts, but it only shows status, not progress. > > wunder > Walter Underwood > wun...@wunderwood.org > http://observer.wunderwood.org/ (my blog) > > > On Feb 7, 2020, at 7:58 AM, Erick Erickson <erickerick...@gmail.com> wrote: > > > > I was wondering about using metrics myself. I confess I didn’t look to see > > what was already there either ;) > > > > Actually, using metrics might be easiest all told, but I also confess I > > have no clue what it takes to build a new metric in. Nor how to use the > > same (?) collection process for the 5 situations I outlined, and those just > > off the top of my head. > > > > It’s particularly frustrating when diagnosing these not knowing whether the > > “recovering” state is going to resolve itself sometime or not. I’ve seen > > Solr replicas stuck in that state forever…. > > > > Andrzej could certainly shed some light on that question. > > > > All ideas welcome of course! > > > >> On Feb 7, 2020, at 10:40 AM, Jan Høydahl <jan....@cominvent.com> wrote: > >> > >> Could we expose some high level recovery info as part of metrics api? Then > >> people could track number of cores recovering, recovery time, recovery > >> phase, number of recoveries failed etc, and also build alerts on top of > >> that. > >> > >> Jan Høydahl > >> > >>> 6. feb. 2020 kl. 19:42 skrev Erick Erickson <erickerick...@gmail.com>: > >>> > >>> There’s actually a crying need for this, but there’s nothing that’s > >>> there yet, basically you have to look at the log files and try to figure > >>> it out. > >>> > >>> Actually I think this would be a great thing to work on, but it’d be > >>> pretty much all new. If you’d like, you can create a Solr Improvement > >>> Proposal here: > >>> https://cwiki.apache.org/confluence/display/SOLR/SIP+Template to flesh > >>> out what this would look like. > >>> > >>> A couple of thoughts off the top of my head: > >>> > >>> I really think what would be most useful would be a collections API > >>> command, something like “RECOVERYSTATUS”, or maybe extend CLUSTERSTATUS. > >>> Currently a replica can be stuck in recovery and never get out. There are > >>> several scenarios that’d have to be considered: > >>> > >>> 1> normal startup. The replica briefly goes from down->recovering->active > >>> which should be quite brief. > >>> 1a> Waiting for a leader to be elected before continuing > >>> > >>> 2> “peer sync” where another replica is replaying documents from the tlog. > >>> > >>> 3> situations where the replica is replaying documents from its own tlog. > >>> This can be very, very, very long too. > >>> > >>> 4> full sync where it’s copying the entire index from a leader. > >>> > >>> 5> knickers in a knot, it’s given up even trying to recover. > >>> > >>> In either case, you’d want to report “all ok” if nothing was in recovery, > >>> “just the ones having trouble” and “everything because I want to look”. > >>> > >>> But like I said, there’s nothing really built into the system to > >>> accomplish this now that I know of. > >>> > >>> Best, > >>> Erick > >>> > >>>> On Feb 6, 2020, at 12:15 PM, dj-manning <derek.mann...@superna.net> > >>>> wrote: > >>>> > >>>> Erick Erickson wrote > >>>>> When you say “look”, where are you looking from? Http requests? SolrJ? > >>>>> The > >>>>> admin UI? > >>>> > >>>> I'm open to looking form anywhere - http request, or the admin UI, or > >>>> following a log if possible. > >>>> > >>>> My objective for this ask would be to human interactively follow/watch > >>>> solr's recovery progress - if that's even possible. > >>>> > >>>> Stretch goal would be to autonomously report on recovery progress. > >>>> > >>>> The question stems from seeing recovery in log or the admin UI, then > >>>> wondering what progress is. > >>>> > >>>> Appreciation. > >>>> > >>>> > >>>> > >>>> > >>>> -- > >>>> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html > >>> > > >