On 9/13/2018 6:12 PM, Andrew Lunn wrote:
        devlink health sensor set pci/0000:01:00.0 name TX_COMP_ERROR action 
reset off action dump on
            Sets TX_COMP_ERROR sensor parameters for a specific device.

This is what I had in mind:
1. command interface error
2. command interface timeout
3. stuck TX queue (like tx_timeout)
4. stuck TX completion queue (driver did not process packets in a reasonable
time period)
5. stuck RX queue
6. RX completion error
7. TX completion error
8. HW / FW catastrophic error report
9. completion queue overrun

Such issues do exist in production environment, and need to be handled even
if root cause is a bug which will be fixed in latest release. My feature
should help developers / administrator to control and recover their live
systems, by auto correction and logging support.
Goal is:
- Provide alert debug information
- Self healing
- If problem needs vendor support, provide a way to gather all needed
debugging information.

So maybe you have the wrong name for this. Health is nice in terms of
Marketing, but we are actually talking about bug recovery.

The way I see it, this feature is responsible for the health of the system from the pci/xxxx perspective. I though about devlink-recover for example, but I really wouldn't like to limit the feature to be called after one of its actions. The same for devlink-bug, which highlights only part of the range of capabilities (sensor).

My work is currently focused on error reporting and recovery, but I wouldn't like to see the API limited for "bugs" only.

Eran


devlink bug sensor set pci/0000:01:00.0 name command_interface_error action 
reset off action dump on
devlink bug sensor set pci/0000:01:00.0 name command_interface_timeout action 
reset off action dump on
devlink bug sensor set pci/0000:01:00.0 name transmit_completion_error action 
reset off action dump on
devlink bug sensor set pci/0000:01:00.0 name completion_queue_overrun action 
reset off action dump on

seems a lot more understandable than:

devlink health set pci/0000:01:00.0 name TX_COMP_ERROR action reset off action 
dump on

        Andrew

Reply via email to