On Thu, 2020-11-05 at 13:36 +0000, George Cherian wrote: > Hi Saeed, > > Thanks for the review. > > > -----Original Message----- > > From: Saeed Mahameed <sa...@kernel.org> > > Sent: Thursday, November 5, 2020 10:39 AM > > To: George Cherian <gcher...@marvell.com>; netdev@vger.kernel.org; > > linux-ker...@vger.kernel.org; Jiri Pirko <j...@nvidia.com> > > Cc: k...@kernel.org; da...@davemloft.net; Sunil Kovvuri Goutham > > <sgout...@marvell.com>; Linu Cherian <lcher...@marvell.com>; > > Geethasowjanya Akula <gak...@marvell.com>; masahi...@kernel.org; > > willemdebruijn.ker...@gmail.com > > Subject: Re: [PATCH v2 net-next 3/3] octeontx2-af: Add devlink > > health > > reporters for NIX > > > > On Wed, 2020-11-04 at 17:57 +0530, George Cherian wrote: > > > Add health reporters for RVU NPA block. > > ^^^ NIX ? > > > Yes, it's NIX. > > > Cc: Jiri > > > > Anyway, could you please spare some words on what is NPA and what > > is > > NIX? > > > > Regarding the reporters names, all drivers register well known > > generic names > > such as (fw,hw,rx,tx), I don't know if it is a good idea to use > > vendor specific > > names, if you are reporting for hw/fw units then just use "hw" or > > "fw" as the > > reporter name and append the unit NPA/NIX to the counter/error > > names. > Okay. These are hw units, I will rename them as hw_npa/hw_nix.
What i meant is have one reporter named "hw" and inside it report counters with their unit name appended to the counter name. ./devlink health pci/0002:01:00.0: reporter hw state healthy error 0 recover 0 ./devlink health dump show pci/0002:01:00.0 reporter hw NIX: nix_counter_a: 0 ... NPA: npa_counter_a: 0 ... > > > Only reporter dump is supported. > > > > > > Output: > > > # ./devlink health > > > pci/0002:01:00.0: > > > reporter npa > > > state healthy error 0 recover 0 > > > reporter nix > > > state healthy error 0 recover 0 > > > # ./devlink health dump show pci/0002:01:00.0 reporter nix > > > NIX_AF_GENERAL: > > > Memory Fault on NIX_AQ_INST_S read: 0 > > > Memory Fault on NIX_AQ_RES_S write: 0 > > > AQ Doorbell error: 0 > > > Rx on unmapped PF_FUNC: 0 > > > Rx multicast replication error: 0 > > > Memory fault on NIX_RX_MCE_S read: 0 > > > Memory fault on multicast WQE read: 0 > > > Memory fault on mirror WQE read: 0 > > > Memory fault on mirror pkt write: 0 > > > Memory fault on multicast pkt write: 0 > > > NIX_AF_RAS: > > > Poisoned data on NIX_AQ_INST_S read: 0 > > > Poisoned data on NIX_AQ_RES_S write: 0 > > > Poisoned data on HW context read: 0 > > > Poisoned data on packet read from mirror buffer: 0 > > > Poisoned data on packet read from mcast buffer: 0 > > > Poisoned data on WQE read from mirror buffer: 0 > > > Poisoned data on WQE read from multicast buffer: 0 > > > Poisoned data on NIX_RX_MCE_S read: 0 > > > NIX_AF_RVU: > > > Unmap Slot Error: 0 > > > > > > > Now i am a little bit skeptic here, devlink health reporter > > infrastructure was > > never meant to deal with dump op only, the main purpose is to > > diagnose/dump and recover. > > > > especially in your use case where you only report counters, i don't > > believe > > devlink health dump is a proper interface for this. > These are not counters. These are error interrupts raised by HW > blocks. > The count is provided to understand on how frequently the errors are > seen. > Error recovery for some of the blocks happen internally. That is the > reason, > Currently only dump op is added. So you are counting these events in driver, sounds like a counter to me, i really think this shouldn't belong to devlink, unless you really utilize devlink health ops for actual reporting and recovery. what's wrong with just dumping these counters to ethtool ? > > Many of these counters if not most are data path packet based and > > maybe > > they should belong to ethtool. > > Regards, > -George > > >