I was wondering about using metrics myself. I confess I didn’t look to see what 
was already there either ;)

Actually, using metrics might be easiest all told, but I also confess I have no 
clue what it takes to build a new metric in. Nor how to use the same (?) 
collection process for the 5 situations I outlined, and those just off the top 
of my head.

It’s particularly frustrating when diagnosing these not knowing whether the 
“recovering” state is going to resolve itself sometime or not. I’ve seen Solr 
replicas stuck in that state forever….

Andrzej could certainly shed some light on that question.

All ideas welcome of course!

> On Feb 7, 2020, at 10:40 AM, Jan Høydahl <jan....@cominvent.com> wrote:
> 
> Could we expose some high level recovery info as part of metrics api? Then 
> people could track number of cores recovering, recovery time, recovery phase, 
> number of recoveries failed etc, and also build alerts on top of that.
> 
> Jan Høydahl
> 
>> 6. feb. 2020 kl. 19:42 skrev Erick Erickson <erickerick...@gmail.com>:
>> 
>> There’s actually a crying need for this, but there’s nothing that’s there 
>> yet, basically you have to look at the log files and try to figure it out. 
>> 
>> Actually I think this would be a great thing to work on, but it’d be pretty 
>> much all new. If you’d like, you can create a Solr Improvement Proposal 
>> here: https://cwiki.apache.org/confluence/display/SOLR/SIP+Template to flesh 
>> out what this would look like.
>> 
>> A couple of thoughts off the top of my head:
>> 
>> I really think what would be most useful would be a collections API command, 
>> something like “RECOVERYSTATUS”, or maybe extend CLUSTERSTATUS. Currently a 
>> replica can be stuck in recovery and never get out. There are several 
>> scenarios that’d have to be considered:
>> 
>> 1> normal startup. The replica briefly goes from down->recovering->active 
>> which should be quite brief. 
>> 1a> Waiting for a leader to be elected before continuing
>> 
>> 2> “peer sync” where another replica is replaying documents from the tlog.
>> 
>> 3> situations where the replica is replaying documents from its own tlog. 
>> This can be very, very, very long too.
>> 
>> 4> full sync where it’s copying the entire index from a leader.
>> 
>> 5> knickers in a knot, it’s given up even trying to recover.
>> 
>> In either case, you’d want to report “all ok” if nothing was in recovery, 
>> “just the ones having trouble” and “everything because I want to look”.
>> 
>> But like I said, there’s nothing really built into the system to accomplish 
>> this now that I know of.
>> 
>> Best,
>> Erick
>> 
>>> On Feb 6, 2020, at 12:15 PM, dj-manning <derek.mann...@superna.net> wrote:
>>> 
>>> Erick Erickson wrote
>>>> When you say “look”, where are you looking from? Http requests? SolrJ? The
>>>> admin UI?
>>> 
>>> I'm open to looking form anywhere  - http request, or the admin UI, or
>>> following a log if possible. 
>>> 
>>> My objective for this ask would be to human interactively follow/watch
>>> solr's recovery progress - if that's even possible.
>>> 
>>> Stretch goal would be to autonomously report on recovery progress.
>>> 
>>> The question stems from seeing recovery in log or the admin UI, then
>>> wondering what progress is.  
>>> 
>>> Appreciation.
>>> 
>>> 
>>> 
>>> 
>>> --
>>> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>> 

Reply via email to