On Tue, 9 Mar 2021 16:18:58 +0200 Eran Ben Elisha wrote: > >> DLH_REMEDY_LOCAL_FIX: associated component will undergo a local > >> un-harmful fix attempt. > >> (e.g look for lost interrupt in mlx5e_tx_reporter_timeout_recover()) > > > > Should we make it more specific? Maybe DLH_REMEDY_STALL: device stall > > detected, resumed by re-trigerring processing, without reset? > > Sounds good.
FWIW I ended up calling it: + * @DLH_REMEDY_KICK: device stalled, processing will be re-triggered > >> The assumption here is that a reporter's recovery function has one > >> remedy. But it can have few remedies and escalate between them. Did you > >> consider a bitmask? > > > > Yes, I tried to explain in the commit message. If we wanted to support > > escalating remediations we'd also need separate counters etc. I think > > having a health reporter per remediation should actually work fairly > > well. > > That would require reporter's recovery procedure failure to trigger > health flow for other reporter. > So we can find ourselves with 2 RX reporters, sharing the same diagnose > and dump callbacks, and each has other recovery flow. > Seems a bit counterintuitive. Let's talk about particular cases. Otherwise it's too easy to misunderstand each other. I can't think of any practical case where escalation makes sense. > Maybe, per reporter, exposing a counter per each supported remedy is not > that bad? It's a large change to the uAPI, and it makes vendors more likely to lump different problems under a single reporter (although I take your point that it may cause over-splitting, but if we have to choose between the two my preference is "too granular").