On 9/27/2018 5:34 PM, Jiri Pirko wrote:
Thu, Sep 27, 2018 at 04:02:48PM CEST, era...@mellanox.com wrote:


On 9/27/2018 3:47 PM, Jiri Pirko wrote:
Wed, Sep 26, 2018 at 01:52:58PM CEST, era...@mellanox.com wrote:
The exception spec is targeted for Real Time Alerting, in order to know when
something bad had happened to a PCI device
- Provide alert debug information
- Self healing
- If problem needs vendor support, provide a way to gather all needed debugging
   information.

The exception mechanism contains condition checkers which sense for 
malfunction. Upon a condition hit,
actions such as logs and correction can be taken.

The condition checkers are divided into the following groups
- Hardware - a checker which is triggered by the device due to
   malfunction.
- Software - a checker which is triggered by the software due to
   malfunction.

What do you mean by a "software malfunction", a "FW malfunction"?
Also, I don't see this 2 groups in the man.

Software malfunction can be a Transmit error (caused by bad send request).

Sorry, but I still don't undestand what "software malfuntion" are you
talking about. Could you be more specific please?

* Driver is building a bad send Work request (bug in driver, bug in packet generator, etc). When it sends it, it gets back an error completion from the HW. This error might cause the HW Queue to be in error state and cannot be used again until it is being "recovered".

Condition: Error completion
Action: Queue recover
The entire scenario is due to SW malfunction.

* Driver is trying to configure HW QoS register bug failed by the FW.

Condition: command execution error
Action: Dump of command + Dump of SW internal related DB + Dump of FW related DB

* Another existing example is the ndo_tx_timeout routine. (This is being done in the networking stuck layer, and can be configured today from a sysfs). If a vendor driver has other specific checking routine like this one in its driver (which he needs to configure from userspace), then it can handled via devlink-exception and be tagged as a software condition.



FW/HW malfunction can be any catastrophic error report (the ones that should
be exposed to driver).
The comment here was to highlight that we can support different kinds of
condition groups.
If for a specific condition, we will need to highlight it is SW/HW, we can
concatenate it to its name.

Eran



Reply via email to