I wrote some Python that checks CLUSTERSTATUS and reports replica status to 
Telegraf. Great for charts and alerts, but it only shows status, not progress.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

> On Feb 7, 2020, at 7:58 AM, Erick Erickson <erickerick...@gmail.com> wrote:
> 
> I was wondering about using metrics myself. I confess I didn’t look to see 
> what was already there either ;)
> 
> Actually, using metrics might be easiest all told, but I also confess I have 
> no clue what it takes to build a new metric in. Nor how to use the same (?) 
> collection process for the 5 situations I outlined, and those just off the 
> top of my head.
> 
> It’s particularly frustrating when diagnosing these not knowing whether the 
> “recovering” state is going to resolve itself sometime or not. I’ve seen Solr 
> replicas stuck in that state forever….
> 
> Andrzej could certainly shed some light on that question.
> 
> All ideas welcome of course!
> 
>> On Feb 7, 2020, at 10:40 AM, Jan Høydahl <jan....@cominvent.com> wrote:
>> 
>> Could we expose some high level recovery info as part of metrics api? Then 
>> people could track number of cores recovering, recovery time, recovery 
>> phase, number of recoveries failed etc, and also build alerts on top of that.
>> 
>> Jan Høydahl
>> 
>>> 6. feb. 2020 kl. 19:42 skrev Erick Erickson <erickerick...@gmail.com>:
>>> 
>>> There’s actually a crying need for this, but there’s nothing that’s there 
>>> yet, basically you have to look at the log files and try to figure it out. 
>>> 
>>> Actually I think this would be a great thing to work on, but it’d be pretty 
>>> much all new. If you’d like, you can create a Solr Improvement Proposal 
>>> here: https://cwiki.apache.org/confluence/display/SOLR/SIP+Template to 
>>> flesh out what this would look like.
>>> 
>>> A couple of thoughts off the top of my head:
>>> 
>>> I really think what would be most useful would be a collections API 
>>> command, something like “RECOVERYSTATUS”, or maybe extend CLUSTERSTATUS. 
>>> Currently a replica can be stuck in recovery and never get out. There are 
>>> several scenarios that’d have to be considered:
>>> 
>>> 1> normal startup. The replica briefly goes from down->recovering->active 
>>> which should be quite brief. 
>>> 1a> Waiting for a leader to be elected before continuing
>>> 
>>> 2> “peer sync” where another replica is replaying documents from the tlog.
>>> 
>>> 3> situations where the replica is replaying documents from its own tlog. 
>>> This can be very, very, very long too.
>>> 
>>> 4> full sync where it’s copying the entire index from a leader.
>>> 
>>> 5> knickers in a knot, it’s given up even trying to recover.
>>> 
>>> In either case, you’d want to report “all ok” if nothing was in recovery, 
>>> “just the ones having trouble” and “everything because I want to look”.
>>> 
>>> But like I said, there’s nothing really built into the system to accomplish 
>>> this now that I know of.
>>> 
>>> Best,
>>> Erick
>>> 
>>>> On Feb 6, 2020, at 12:15 PM, dj-manning <derek.mann...@superna.net> wrote:
>>>> 
>>>> Erick Erickson wrote
>>>>> When you say “look”, where are you looking from? Http requests? SolrJ? The
>>>>> admin UI?
>>>> 
>>>> I'm open to looking form anywhere  - http request, or the admin UI, or
>>>> following a log if possible. 
>>>> 
>>>> My objective for this ask would be to human interactively follow/watch
>>>> solr's recovery progress - if that's even possible.
>>>> 
>>>> Stretch goal would be to autonomously report on recovery progress.
>>>> 
>>>> The question stems from seeing recovery in log or the admin UI, then
>>>> wondering what progress is.  
>>>> 
>>>> Appreciation.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>>> 
> 

Reply via email to