Re: [RFC] devlink: health: add remediation type

Jakub Kicinski Tue, 09 Mar 2021 14:58:44 -0800

On Tue, 9 Mar 2021 16:18:58 +0200 Eran Ben Elisha wrote:
> >> DLH_REMEDY_LOCAL_FIX: associated component will undergo a local
> >> un-harmful fix attempt.
> >> (e.g look for lost interrupt in mlx5e_tx_reporter_timeout_recover())  
> > 
> > Should we make it more specific? Maybe DLH_REMEDY_STALL: device stall
> > detected, resumed by re-trigerring processing, without reset?  
> 
> Sounds good.


FWIW I ended up calling it:

+ * @DLH_REMEDY_KICK: device stalled, processing will be re-triggered

> >> The assumption here is that a reporter's recovery function has one
> >> remedy. But it can have few remedies and escalate between them. Did you
> >> consider a bitmask?  
> > 
> > Yes, I tried to explain in the commit message. If we wanted to support
> > escalating remediations we'd also need separate counters etc. I think
> > having a health reporter per remediation should actually work fairly
> > well.  
> 
> That would require reporter's recovery procedure failure to trigger 
> health flow for other reporter.
> So we can find ourselves with 2 RX reporters, sharing the same diagnose 
> and dump callbacks, and each has other recovery flow.
> Seems a bit counterintuitive.

Let's talk about particular cases. Otherwise it's too easy to
misunderstand each other. I can't think of any practical case
where escalation makes sense.

> Maybe, per reporter, exposing a counter per each supported remedy is not 
> that bad?

It's a large change to the uAPI, and it makes vendors more likely 
to lump different problems under a single reporter (although I take
your point that it may cause over-splitting, but if we have to choose
between the two my preference is "too granular").

Re: [RFC] devlink: health: add remediation type

Reply via email to