On Thu, Aug 27, 2020 at 05:54:43PM +0800, Shung-Hsi Yu wrote:
On Wed, Aug 26, 2020 at 03:52:06PM +0800, Shung-Hsi Yu wrote:
On Sat, Aug 15, 2020 at 12:05:58AM +0800, Coiby Xu wrote:
> This patch set aims to avoid dumping registers, data structures and
> coredump to dmesg and also to reduce the code size of the qlge driver.
>
> As pointed out by Benjamin [1],
>
> > At 2000 lines, qlge_dbg.c alone is larger than some entire ethernet
> > drivers. Most of what it does is dump kernel data structures or pci
> > memory mapped registers to dmesg. There are better facilities for that.
> > My thinking is not simply to delete qlge_dbg.c but to replace it, making
> > sure that most of the same information is still available. For data
> > structures, crash or drgn can be used; possibly with a script for the
> > latter which formats the data. For pci registers, they should be
> > included in the ethtool register dump and a patch added to ethtool to
> > pretty print them. That's what other drivers like e1000e do. For the
> > "coredump", devlink health can be used.
>
> So the debugging features are re-written following Benjamin's advice,
>    - use ethtool to dump registers
>    - dump kernel data structures in drgn
>    - use devlink health to do coredump
>
> The get_regs ethtool_ops has already implemented. What lacks is a patch
> for the userland ethtool to do the pretty-printing. I haven't yet provided
> a patch to the userland ethtool because I'm aware ethtool is moving towards
> the netlink interface [2]. I'm curious if a generalized mechanism of
> pretty-printing will be implemented thus making pretty-printing for a
> specific driver unnecessary. As of this writing, `-d|--register-dump`
> hasn't been implemented for the netlink interface.
>
>
> To dump kernel data structures, the following Python script can be used
> in drgn,
>
>
>     ```python
>     def align(x, a):
>         """the alignment a should be a power of 2
>         """
>         mask = a - 1
>         return (x+ mask) & ~mask
>
>     def struct_size(struct_type):
>         struct_str = "struct {}".format(struct_type)
>         return sizeof(Object(prog, struct_str, address=0x0))
>
>     def netdev_priv(netdevice):
>         NETDEV_ALIGN = 32
>         return netdevice.value_() + align(struct_size("net_device"), 
NETDEV_ALIGN)
>
>     name = 'xxx'
>     qlge_device = None
>     netdevices = prog['init_net'].dev_base_head.address_of_()
>     for netdevice in list_for_each_entry("struct net_device", netdevices, 
"dev_list"):
>         if netdevice.name.string_().decode('ascii') == name:
>             print(netdevice.name)
>
>     ql_adapter = Object(prog, "struct ql_adapter", 
address=netdev_priv(qlge_device))
>     ```
>
> The struct ql_adapter will be printed in drgn as follows,
>     >>> ql_adapter
>     (struct ql_adapter){
>             .ricb = (struct ricb){
>                     .base_cq = (u8)0,
>                     .flags = (u8)120,
>                     .mask = (__le16)26637,
>                     .hash_cq_id = (u8 [1024]){ 172, 142, 255, 255 },
>                     .ipv6_hash_key = (__le32 [10]){},
>                     .ipv4_hash_key = (__le32 [4]){},
>             },
>             .flags = (unsigned long)0,
>             .wol = (u32)0,
>             .nic_stats = (struct nic_stats){
>                     .tx_pkts = (u64)0,
>                     .tx_bytes = (u64)0,
>                     .tx_mcast_pkts = (u64)0,
>                     .tx_bcast_pkts = (u64)0,
>                     .tx_ucast_pkts = (u64)0,
>                     .tx_ctl_pkts = (u64)0,
>                     .tx_pause_pkts = (u64)0,
>                     ...
>             },
>             .active_vlans = (unsigned long [64]){
>                     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 52780853100545, 18446744073709551615,
>                     18446619461681283072, 0, 42949673024, 2147483647,
>             },
>             .rx_ring = (struct rx_ring [17]){
>                     {
>                             .cqicb = (struct cqicb){
>                                     .msix_vect = (u8)0,
>                                     .reserved1 = (u8)0,
>                                     .reserved2 = (u8)0,
>                                     .flags = (u8)0,
>                                     .len = (__le16)0,
>                                     .rid = (__le16)0,
>                                     ...
>                             },
>                             .cq_base = (void *)0x0,
>                             .cq_base_dma = (dma_addr_t)0,
>                     }
>                     ...
>             }
>     }
>
>
> And the coredump obtained via devlink in json format looks like,
>
>     $ devlink health dump show DEVICE reporter coredump -p -j
>     {
>         "Core Registers": {
>             "segment": 1,
>             "values": [ 
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
 ]
>         },
>         "Test Logic Regs": {
>             "segment": 2,
>             "values": [ 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 ]
>         },
>         "RMII Registers": {
>             "segment": 3,
>             "values": [ 
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
 ]
>         },
>         ...
>         "Sem Registers": {
>             "segment": 50,
>             "values": [ 0,0,0,0 ]
>         }
>     }
>
> Since I don't have a QLGE device and neither could I find a software
> simulator, I put some functions into e1000 to get the above result.

I'm not familiar with qlge, but I do happen to have one accessible. You can
add me to CC for future version of this patch.

Testing with the refreshed patch set I got from Benjamin (apply over commit
fc80c51fd4b2) with debugging options KASAN, UBSAN, DEBUG_ATOMIC_SLEEP,
PROVE_LOCKING and DEBUG_KMEMLEAK enabled, I can verify coredump through
devlink works.

Benjamin reminded me to check for errors in the remove callpaths as well,
and some error showed up in the remove callpath indeed.

 $ modprobe qlge
 [ 1314.256249] qlge: module is from the staging directory, the quality is 
unknown, you have been warned.
 [ 1320.184708] qlge 0000:00:04.0: QLogic 10 Gigabit PCI-E Ethernet Driver
 [ 1320.185619] qlge 0000:00:04.0: Driver name: qlge, Version: 1.00.00.35.
 [ 1320.196987] qlge 0000:00:04.0 eth0: Link is down.
 [ 1320.197830] qlge 0000:00:04.0 eth0: Clearing MAC address
 [ 1320.199184] qlge 0000:00:04.0 eth0: Function #0, Port 0, NIC Roll 0, NIC 
Rev = 1, XG Roll = 0, XG Rev = 1.
 [ 1320.200793] qlge 0000:00:04.0 eth0: MAC address 00:c0:dd:14:54:b8
 [ 1320.276835] qlge 0000:00:04.0 ens4: renamed from eth0
 [ 1326.920224] PCI Interrupt Link [LNKB] enabled at IRQ 10
 [ 1326.943443] qlge 0000:00:05.0 eth0: Link is down.
 [ 1326.944188] qlge 0000:00:05.0 eth0: Clearing MAC address
 [ 1326.944975] qlge 0000:00:05.0 eth0: Function #1, Port 1, NIC Roll 0, NIC 
Rev = 1, XG Roll = 0, XG Rev = 1.
 [ 1326.946396] qlge 0000:00:05.0 eth0: MAC address 00:c0:dd:14:54:ba
 [ 1326.997071] qlge 0000:00:05.0 ens5: renamed from eth0
 $ rmmod qlge

Simple load and unload works fine.

 $ modprobe qlge
 [ 1348.642124] qlge: module is from the staging directory, the quality is 
unknown, you have been warned.
 [ 1355.090448] qlge 0000:00:04.0: QLogic 10 Gigabit PCI-E Ethernet Driver
 [ 1355.091439] qlge 0000:00:04.0: Driver name: qlge, Version: 1.00.00.35.
 [ 1355.098890] qlge 0000:00:04.0 eth0: Link is down.
 [ 1355.099647] qlge 0000:00:04.0 eth0: Clearing MAC address
 [ 1355.100708] qlge 0000:00:04.0 eth0: Function #0, Port 0, NIC Roll 0, NIC 
Rev = 1, XG Roll = 0, XG Rev = 1.
 [ 1355.102301] qlge 0000:00:04.0 eth0: MAC address 00:c0:dd:14:54:b8
 [ 1355.170258] qlge 0000:00:04.0 ens4: renamed from eth0
 [ 1361.809918] qlge 0000:00:05.0 eth0: Link is down.
 [ 1361.810674] qlge 0000:00:05.0 eth0: Clearing MAC address
 [ 1361.811431] qlge 0000:00:05.0 eth0: Function #1, Port 1, NIC Roll 0, NIC 
Rev = 1, XG Roll = 0, XG Rev = 1.
 [ 1361.812594] qlge 0000:00:05.0 eth0: MAC address 00:c0:dd:14:54:ba
 [ 1361.874453] qlge 0000:00:05.0 ens5: renamed from eth0
 $ ip link set ens4 up
 [ 1370.994196] qlge 0000:00:04.0 ens4: MSI-X Enabled, got 2 vectors.
 [ 1371.199630] qlge 0000:00:04.0 ens4: Clearing MAC address
 [ 1371.299084] qlge 0000:00:04.0 ens4: Passed Get Port Configuration.
 $ devlink health dump show pci/0000:00:04.0 reporter coredump -p -j >log.jsonho
 $ rmmod qlge
 [ 1413.537710] qlge 0000:00:04.0 ens4: Link is down.
 [ 1413.538570] qlge 0000:00:04.0 ens4: Clearing MAC address
 [ 1413.539819] qlge 0000:00:04.0 ens4: Command not supported by firmware.
 [ 1413.541068] qlge 0000:00:04.0 ens4: Command not supported by firmware.
 [ 1413.649713] qlge 0000:00:04.0 ens4: Command not supported by firmware.
 [ 1413.761705] qlge 0000:00:04.0 ens4: Command not supported by firmware.
 [ 1413.873716] qlge 0000:00:04.0 ens4: Command not supported by firmware.
 [ 1413.985746] qlge 0000:00:04.0 ens4: Command not supported by firmware.
 [ 1414.097743] qlge 0000:00:04.0 ens4: Command not supported by firmware.

Unloading the module after coredump through devlink shows the above error.
At first glance this look quite similar to the errors I get when trying to
coredump when the interface is down, causing the device to go into an
invalid state as well.

 $ modprobe qlge
 [ 1606.229089] qlge: module is from the staging directory, the quality is 
unknown, you have been warned.
 [ 1612.423342] qlge 0000:00:04.0: QLogic 10 Gigabit PCI-E Ethernet Driver
 [ 1612.424551] qlge 0000:00:04.0: Driver name: qlge, Version: 1.00.00.35.
 [ 1612.432718] qlge 0000:00:04.0 eth0: Link is down.
 [ 1612.433627] qlge 0000:00:04.0 eth0: Clearing MAC address
 [ 1612.434621] qlge 0000:00:04.0 eth0: Function #0, Port 0, NIC Roll 0, NIC 
Rev = 1, XG Roll = 0, XG Rev = 1.
 [ 1612.436582] qlge 0000:00:04.0 eth0: MAC address 00:c0:dd:14:54:b8
 [ 1612.494647] qlge 0000:00:04.0 ens4: renamed from eth0
 [ 1619.078325] qlge 0000:00:05.0 eth0: Link is down.
 [ 1619.079119] qlge 0000:00:05.0 eth0: Clearing MAC address
 [ 1619.080165] qlge 0000:00:05.0 eth0: Function #1, Port 1, NIC Roll 0, NIC 
Rev = 1, XG Roll = 0, XG Rev = 1.
 [ 1619.081938] qlge 0000:00:05.0 eth0: MAC address 00:c0:dd:14:54:ba
 [ 1619.154477] qlge 0000:00:05.0 ens5: renamed from eth0
 $ ip link set ens4 up
 [ 1629.717532] qlge 0000:00:04.0 ens4: Command not supported by firmware.
 [ 1629.720692] qlge 0000:00:04.0 ens4: Command not supported by firmware.
 [ 1629.827878] qlge 0000:00:04.0 ens4: Command not supported by firmware.
 [ 1629.939891] qlge 0000:00:04.0 ens4: Command not supported by firmware.
 [ 1630.051855] qlge 0000:00:04.0 ens4: Command not supported by firmware.
 [ 1630.163885] qlge 0000:00:04.0 ens4: Command not supported by firmware.
 [ 1630.275872] qlge 0000:00:04.0 ens4: Command not supported by firmware.
 [ 1630.387893] qlge 0000:00:04.0 ens4: Command not supported by firmware.
 [ 1630.412685] qlge 0000:00:04.0 ens4: MSI-X Enabled, got 2 vectors.
 [ 1630.420868] qlge 0000:00:04.0 ens4: Failed about firmware command
 [ 1630.425110] qlge 0000:00:04.0 ens4: Failed to start port.
 [ 1630.427966] qlge 0000:00:04.0 ens4: Clearing MAC address
 [ 1630.487603] qlge 0000:00:04.0 ens4: Failed Get Port Configuration.

I'll report when I've more findings.


Thank you for the testing! I will assume these errors are caused by
coredump when the interface is down. Please let me know it still occurs
in v1.

<snip>


--
Best regards,
Coiby

Reply via email to