I wrote some Python that checks CLUSTERSTATUS and reports replica status to Telegraf. Great for charts and alerts, but it only shows status, not progress.
wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Feb 7, 2020, at 7:58 AM, Erick Erickson <erickerick...@gmail.com> wrote: > > I was wondering about using metrics myself. I confess I didn’t look to see > what was already there either ;) > > Actually, using metrics might be easiest all told, but I also confess I have > no clue what it takes to build a new metric in. Nor how to use the same (?) > collection process for the 5 situations I outlined, and those just off the > top of my head. > > It’s particularly frustrating when diagnosing these not knowing whether the > “recovering” state is going to resolve itself sometime or not. I’ve seen Solr > replicas stuck in that state forever…. > > Andrzej could certainly shed some light on that question. > > All ideas welcome of course! > >> On Feb 7, 2020, at 10:40 AM, Jan Høydahl <jan....@cominvent.com> wrote: >> >> Could we expose some high level recovery info as part of metrics api? Then >> people could track number of cores recovering, recovery time, recovery >> phase, number of recoveries failed etc, and also build alerts on top of that. >> >> Jan Høydahl >> >>> 6. feb. 2020 kl. 19:42 skrev Erick Erickson <erickerick...@gmail.com>: >>> >>> There’s actually a crying need for this, but there’s nothing that’s there >>> yet, basically you have to look at the log files and try to figure it out. >>> >>> Actually I think this would be a great thing to work on, but it’d be pretty >>> much all new. If you’d like, you can create a Solr Improvement Proposal >>> here: https://cwiki.apache.org/confluence/display/SOLR/SIP+Template to >>> flesh out what this would look like. >>> >>> A couple of thoughts off the top of my head: >>> >>> I really think what would be most useful would be a collections API >>> command, something like “RECOVERYSTATUS”, or maybe extend CLUSTERSTATUS. >>> Currently a replica can be stuck in recovery and never get out. There are >>> several scenarios that’d have to be considered: >>> >>> 1> normal startup. The replica briefly goes from down->recovering->active >>> which should be quite brief. >>> 1a> Waiting for a leader to be elected before continuing >>> >>> 2> “peer sync” where another replica is replaying documents from the tlog. >>> >>> 3> situations where the replica is replaying documents from its own tlog. >>> This can be very, very, very long too. >>> >>> 4> full sync where it’s copying the entire index from a leader. >>> >>> 5> knickers in a knot, it’s given up even trying to recover. >>> >>> In either case, you’d want to report “all ok” if nothing was in recovery, >>> “just the ones having trouble” and “everything because I want to look”. >>> >>> But like I said, there’s nothing really built into the system to accomplish >>> this now that I know of. >>> >>> Best, >>> Erick >>> >>>> On Feb 6, 2020, at 12:15 PM, dj-manning <derek.mann...@superna.net> wrote: >>>> >>>> Erick Erickson wrote >>>>> When you say “look”, where are you looking from? Http requests? SolrJ? The >>>>> admin UI? >>>> >>>> I'm open to looking form anywhere - http request, or the admin UI, or >>>> following a log if possible. >>>> >>>> My objective for this ask would be to human interactively follow/watch >>>> solr's recovery progress - if that's even possible. >>>> >>>> Stretch goal would be to autonomously report on recovery progress. >>>> >>>> The question stems from seeing recovery in log or the admin UI, then >>>> wondering what progress is. >>>> >>>> Appreciation. >>>> >>>> >>>> >>>> >>>> -- >>>> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html >>> >