On Tue, Dec 12, 2017 at 5:21 AM, Qing Huang <qing.hu...@oracle.com> wrote: > Hi, > > We found an issue with the bonding driver when testing Mellanox devices. > The following test commands will stall the whole system sometimes, with > serial console > flooded with log messages from the bond_miimon_inspect() function. Setting > mtu size > to be 1500 seems okay but very rarely it may hit the same problem too. > > ip address flush dev ens3f0 > ip link set dev ens3f0 down > ip address flush dev ens3f1 > ip link set dev ens3f1 down > [root@ca-hcl629 etc]# modprobe bonding mode=0 miimon=250 use_carrier=1 > updelay=500 downdelay=500 > [root@ca-hcl629 etc]# ifconfig bond0 up > [root@ca-hcl629 etc]# ifenslave bond0 ens3f0 ens3f1 > [root@ca-hcl629 etc]# ip link set bond0 mtu 4500 up
> Seiral console output: > > ** 4 printk messages dropped ** [ 3717.743761] bond0: link status down for > interface ens3f0, disabling it in 500 ms [..] > It seems that when setting a large mtu size on an RoCE interface, the RTNL > mutex may be held too long by the slave > interface, causing bond_mii_monitor() to be called repeatedly at an interval > of 1 tick (1K HZ kernel configuration) and kernel to become unresponsive. Did you try/managed to reproduce that also with other NIC drivers? Or.