Jay, Moni

I did some tests with 2.6.24-rc1 and the first patch to bonding that Jay sent last night to netdev. Basic operation and fail over work fine. However, I see some crashes which are somehow related to destroying the bond when the slaves are ipoib ones, I don't see similar crashes when enslaving ethernet devices (Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet (rev 03)), my compressed dot config is attached.

The first type of oops is when I just do modprobe -r bonding after enslavement of the ipoib devices:

Ethernet Channel Bonding Driver: v3.2.1 (October 15, 2007)
bonding: MII link monitoring set to 100 ms
bonding: bond0: setting mode to active-backup (1).
bonding: bond0: Setting MII monitoring interval to 100.
NET: Registered protocol family 10
ADDRCONF(NETDEV_UP): bond0: link is not ready
bonding: bond0: doing slave updates when interface is down.
bonding: bond0: Adding slave ib0.
bonding bond0: master_dev is not up in bond_enslave
bonding: bond0: Warning: enslaved VLAN challenged slave ib0. Adding VLANs will 
be blocked as long as ib0 is part of bond bond0
bonding: bond0: enslaving ib0 as a backup interface with a down link.
bonding: bond0: doing slave updates when interface is down.
bonding: bond0: Adding slave ib1.
bonding bond0: master_dev is not up in bond_enslave
bonding: bond0: Warning: enslaved VLAN challenged slave ib1. Adding VLANs will 
be blocked as long as ib1 is part of bond bond0
bonding: bond0: enslaving ib1 as a backup interface with a down link.
ADDRCONF(NETDEV_UP): bond0: link is not ready
bonding: bond0: link status definitely up for interface ib0.
bonding: bond0: link status definitely up for interface ib1.
bonding: bond0: making interface ib0 the new active one.
bonding: bond0: first active interface up!
ADDRCONF(NETDEV_CHANGE): bond0: link becomes ready
eth0: no IPv6 routers present
bond0: no IPv6 routers present
bonding: bond0: released all slaves
Unable to handle kernel paging request at ffffffff880a07ce RIP: [<ffffffff880a07ce>]
PGD 203067 PUD 207063 PMD 2060f067 PTE 0
Oops: 0010 [1] SMP CPU 0 Modules linked in: ib_ipoib ib_cm ib_sa ipv6 sg st sd_mod sr_mod scsi_mod e100 ib_mthca ib_mad ib_core i2c_amd8111 i2c_core
Pid: 14604, comm: bond0 Not tainted 2.6.24-rc1 #1
RIP: 0010:[<ffffffff880a07ce>]  [<ffffffff880a07ce>]
RSP: 0018:ffff810008439e98  EFLAGS: 00010247
RAX: ffff810004da20c0 RBX: ffff810004da20c0 RCX: ffff81000315aa68
RDX: ffff810004da20c8 RSI: ffff810008439ef0 RDI: ffff81000315aa60
RBP: ffffffff880a07ce R08: ffff810008438000 R09: ffff81000152d0d8
R10: ffff810004da20c0 R11: ffff810009574000 R12: 00000000fffffffc
R13: ffffffffffffffff R14: ffffffff8063b820 R15: 0000000000000000
FS:  00002af0c528b0a0(0000) GS:ffffffff805d4000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: ffffffff880a07ce CR3: 000000002852f000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process bond0 (pid: 14604, threadinfo ffff810008438000, task ffff8100024970c0)
Stack:  ffffffff802445c6 ffff810008439f08 ffff810004da20c0 ffffffff80244652
 ffffffff8024473f 0000000000000000 ffff8100024970c0 ffffffff80248320
 ffff810008439f08 ffff810008439f08 0000000000000000 0000000000000000
Call Trace:
 [<ffffffff802445c6>] run_workqueue+0x83/0x10f
 [<ffffffff80244652>] worker_thread+0x0/0xf7
 [<ffffffff8024473f>] worker_thread+0xed/0xf7
 [<ffffffff80248320>] autoremove_wake_function+0x0/0x2e
 [<ffffffff80248320>] autoremove_wake_function+0x0/0x2e
 [<ffffffff80247fe6>] kthread+0x3d/0x63
 [<ffffffff8020c4a8>] child_rip+0xa/0x12
 [<ffffffff80247fa9>] kthread+0x0/0x63
 [<ffffffff8020c49e>] child_rip+0x0/0x12


Code:  Bad RIP value.
RIP  [<ffffffff880a07ce>]
 RSP <ffff810008439e98>
CR2: ffffffff880a07ce

the second type of oops is when I modprobe -r ib_ipoib after enslavement. I was not able to test this one with ethernet as the tg3 code is built into my kernel

Nov  7 14:31:56 dill kernel: bonding: bond0: Setting MII monitoring interval to 
100.
Nov  7 14:31:56 dill kernel: bonding: bond0: Adding slave ib0.
Nov  7 14:31:56 dill kernel: bonding: bond0: Warning: enslaved VLAN challenged 
slave ib0. Adding VLANs will be blocked as long as ib0 is part of bond bond0
Nov  7 14:31:56 dill kernel: bonding: bond0: Warning: The first slave device 
specified does not support setting the MAC address. Enabling the fail_over_mac 
option.<6>bonding: bond0: enslaving ib0 as a backup interface with a down link.
Nov  7 14:31:56 dill kernel: bonding: bond0: Adding slave ib1.
Nov  7 14:31:56 dill kernel: bonding: bond0: Warning: enslaved VLAN challenged 
slave ib1. Adding VLANs will be blocked as long as ib1 is part of bond bond0
Nov  7 14:31:56 dill kernel: bonding: bond0: enslaving ib1 as a backup 
interface with a down link.
Nov  7 14:31:56 dill kernel: bonding: bond0: link status definitely up for 
interface ib0.
Nov  7 14:31:56 dill kernel: bonding: bond0: making interface ib0 the new 
active one.
Nov  7 14:31:56 dill kernel: bonding: bond0: first active interface up!
Nov  7 14:31:56 dill kernel: ADDRCONF(NETDEV_CHANGE): bond0: link becomes ready
Nov  7 14:31:56 dill kernel: ib0: multicast join failed for 
0001:ffff:ffff:0a0a:0081:ffff:c0bb:0a0a, status -22
Nov  7 14:31:56 dill kernel: bonding: bond0: link status definitely up for 
interface ib1.
Nov  7 14:31:58 dill kernel: ib0: multicast join failed for 
0001:ffff:ffff:0a0a:0081:ffff:c0bb:0a0a, status -22
Nov  7 14:32:02 dill kernel: ib0: multicast join failed for 
0001:ffff:ffff:0a0a:0081:ffff:c0bb:0a0a, status -22
Nov  7 14:32:07 dill kernel: bond0: no IPv6 routers present
Nov  7 14:32:10 dill kernel: ib0: multicast join failed for 
0001:ffff:ffff:0a0a:0081:ffff:c0bb:0a0a, status -22
Nov  7 14:32:12 dill ypbind[14475]: broadcast: RPC: Timed out.
Nov  7 14:32:18 dill kernel: ib0: cm send completion event with wrid 1073741823 
(> 64)
Nov  7 14:32:23 dill kernel: ib0: RX drain timing out
Nov  7 14:32:23 dill kernel: bonding: bond0: Warning: the permanent HWaddr of 
ib0 - 80:06:04:04:fe:80 - is still in use by bond0. Set the HWaddr of ib0 to a 
different address to avoid conflicts.
Nov  7 14:32:23 dill kernel: bonding: bond0: releasing active interface ib0
Nov  7 14:32:23 dill kernel: bonding: bond0: making interface ib1 the new 
active one.
Nov  7 14:32:23 dill kernel: ib1: multicast join failed for 
0001:0000:ffff:0000:0000:0000:0070:5229, status -22
Nov  7 14:32:23 dill kernel: bonding: bond0: releasing active interface ib1
Nov  7 14:32:23 dill kernel: bonding: bond0: destroying bond bond0.
Nov  7 14:32:23 dill kernel: __dev_addr_discard: address leakage! da_users=1
Nov 7 14:32:23 dill kernel: Unable to handle kernel NULL pointer dereference at 0000000000000028 RIP: Nov 7 14:32:23 dill kernel: [<ffffffff802be76f>] sysfs_find_dirent+0x7/0x36 Nov 7 14:32:23 dill kernel: PGD 250a067 PUD 40a4067 PMD 0 Nov 7 14:32:23 dill kernel: Oops: 0000 [1] SMP Nov 7 14:32:23 dill kernel: CPU 1 Nov 7 14:32:23 dill kernel: Modules linked in: ib_ipoib ib_cm ib_sa bonding e100 ipv6 sg st sd_mod sr_mod scsi_mod ib_mthca ib_mad ib_core i2c_amd756 i2c_amd8111 i2c_core
Nov  7 14:32:23 dill kernel: Pid: 18870, comm: modprobe Not tainted 2.6.24-rc1 
#1
Nov  7 14:32:23 dill kernel: RIP: 0010:[<ffffffff802be76f>]  
[<ffffffff802be76f>] sysfs_find_dirent+0x7/0x36
Nov  7 14:32:23 dill kernel: RSP: 0018:ffff8100264e3da8  EFLAGS: 00010246
Nov  7 14:32:23 dill kernel: RAX: 0000000000000000 RBX: ffffffff88111959 RCX: 
000000000000000a
Nov  7 14:32:23 dill kernel: RDX: ffff8100264e3fd8 RSI: ffffffff88111959 RDI: 
0000000000000000
Nov  7 14:32:23 dill kernel: RBP: ffffffff88111959 R08: ffff810020705d70 R09: 
ffff810020012ae8
Nov  7 14:32:23 dill kernel: R10: 0000000000000000 R11: 0000000000000286 R12: 
0000000000000000
Nov  7 14:32:23 dill kernel: R13: ffff810028d9e000 R14: 0000000000000006 R15: 
0000000000515ab0
Nov  7 14:32:23 dill kernel: FS:  00002adaba330720(0000) 
GS:ffff81002053dac0(0000) knlGS:0000000000000000
Nov  7 14:32:23 dill kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Nov  7 14:32:23 dill kernel: CR2: 0000000000000028 CR3: 0000000001cca000 CR4: 
00000000000006e0
Nov  7 14:32:23 dill kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
0000000000000000
Nov  7 14:32:23 dill kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 
0000000000000400
Nov  7 14:32:23 dill kernel: Process modprobe (pid: 18870, threadinfo 
ffff8100264e2000, task ffff8100264ed790)
Nov  7 14:32:23 dill kernel: Stack:  0000000000000000 ffffffff88111959 
ffffffff88117680 ffffffff802be87b
Nov  7 14:32:23 dill kernel:  0000000000000006 0000000000000000 
ffff810006578700 ffffffff802bfb83
Nov  7 14:32:23 dill kernel:  ffff810020705d70 ffff810006578000 
0000000000000000 ffffffff88107bd6
Nov  7 14:32:24 dill kernel: Call Trace:
Nov  7 14:32:24 dill kernel:  [<ffffffff802be87b>] sysfs_get_dirent+0x21/0x6c
Nov  7 14:32:24 dill kernel:  [<ffffffff802bfb83>] sysfs_remove_group+0x1b/0x92
Nov  7 14:32:24 dill kernel:  [<ffffffff88107bd6>] 
:bonding:bond_release_and_destroy+0x3d/0x44
Nov  7 14:32:24 dill kernel:  [<ffffffff88107c92>] 
:bonding:bond_netdev_event+0xb5/0xca
Nov  7 14:32:24 dill kernel:  [<ffffffff8046e55e>] notifier_call_chain+0x30/0x54
Nov  7 14:32:24 dill kernel:  [<ffffffff8041845d>] 
unregister_netdevice+0xc3/0x15a
Nov  7 14:32:24 dill kernel:  [<ffffffff80418505>] unregister_netdev+0x11/0x17
Nov  7 14:32:24 dill kernel:  [<ffffffff880f2be4>] 
:ib_ipoib:ipoib_remove_one+0x64/0xa5
Nov  7 14:32:24 dill kernel:  [<ffffffff88015069>] 
:ib_core:ib_unregister_client+0x43/0xfe
Nov  7 14:32:24 dill kernel:  [<ffffffff880fb071>] 
:ib_ipoib:ipoib_cleanup_module+0xd/0x2b
Nov  7 14:32:24 dill kernel:  [<ffffffff802557b1>] sys_delete_module+0x1b1/0x1e2
Nov  7 14:32:24 dill kernel:  [<ffffffff80329b00>] __downgrade_write+0x5f/0xb1
Nov  7 14:32:24 dill kernel:  [<ffffffff8026eb2e>] sys_munmap+0x4a/0x56
Nov  7 14:32:24 dill kernel:  [<ffffffff8020b68e>] system_call+0x7e/0x83
Nov 7 14:32:24 dill kernel: Nov 7 14:32:24 dill kernel: Nov 7 14:32:24 dill kernel: Code: 48 8b 5f 28 48 85 db 74 1c 48 8b 7b 18 48 89 ee e8 f6 b6 06 Nov 7 14:32:24 dill kernel: RIP [<ffffffff802be76f>] sysfs_find_dirent+0x7/0x36
Nov  7 14:32:24 dill kernel:  RSP <ffff8100264e3da8>
Nov  7 14:32:24 dill kernel: CR2: 0000000000000028

the third type of oops is when I did some fail overs, then removed both slaves from the bond using
echo -$slave > /sys/class/net/bond0/bonding/slaves

Ethernet Channel Bonding Driver: v3.2.1 (October 15, 2007)
bonding: MII link monitoring set to 100 ms
bonding: bond0: setting mode to active-backup (1).
bonding: bond0: Setting MII monitoring interval to 100.
ADDRCONF(NETDEV_UP): bond0: link is not ready
bonding: bond0: doing slave updates when interface is down.
bonding: bond0: Adding slave ib0.
bonding bond0: master_dev is not up in bond_enslave
bonding: bond0: Warning: enslaved VLAN challenged slave ib0. Adding VLANs will 
be blocked as long as ib0 is part of bond bond0
bonding: bond0: enslaving ib0 as a backup interface with a down link.
bonding: bond0: doing slave updates when interface is down.
bonding: bond0: Adding slave ib1.
bonding bond0: master_dev is not up in bond_enslave
bonding: bond0: Warning: enslaved VLAN challenged slave ib1. Adding VLANs will 
be blocked as long as ib1 is part of bond bond0
bonding: bond0: enslaving ib1 as a backup interface with a down link.
ADDRCONF(NETDEV_UP): bond0: link is not ready
bonding: bond0: link status definitely up for interface ib0.
bonding: bond0: link status definitely up for interface ib1.
bonding: bond0: making interface ib0 the new active one.
bonding: bond0: first active interface up!
ADDRCONF(NETDEV_CHANGE): bond0: link becomes ready
bond0: no IPv6 routers present
bonding: bond0: link status definitely down for interface ib0, disabling it
bonding: bond0: making interface ib1 the new active one.
bonding: bond0: link status definitely up for interface ib0.
bonding: bond0: link status definitely down for interface ib1, disabling it
bonding: bond0: making interface ib0 the new active one.
bonding: bond0: Removing slave ib0
bonding: bond0: Warning: the permanent HWaddr of ib0 - 80:08:04:04:fe:80 - is 
still in use by bond0. Set the HWaddr of ib0 to a different address to avoid 
conflicts.
bonding: bond0: releasing active interface ib0
bonding: bond0: Removing slave ib1
bonding: bond0: releasing backup interface ib1
bonding: bond0: destroying bond bond0.
Unable to handle kernel NULL pointer dereference at 0000000000000028 RIP: [<ffffffff802be76f>] sysfs_find_dirent+0x7/0x36 PGD 48a0067 PUD 285f067 PMD 0 Oops: 0000 [1] SMP CPU 1 Modules linked in: ib_ipoib ib_cm ib_sa bonding ipv6 sg st sd_mod sr_mod scsi_mod e100 ib_mthca ib_mad ib_core i2c_amd756 i2c_amd8111 i2c_core
Pid: 16811, comm: bash Not tainted 2.6.24-rc1 #1
RIP: 0010:[<ffffffff802be76f>]  [<ffffffff802be76f>] sysfs_find_dirent+0x7/0x36
RSP: 0018:ffff8100049a5dd8  EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffffffff880ae959 RCX: 0000000000000002
RDX: ffff8100049a5fd8 RSI: ffffffff880ae959 RDI: 0000000000000000
RBP: ffffffff880ae959 R08: ffff8100205f5d70 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000
R13: 0000000000000000 R14: ffff8100300c7000 R15: ffff8100049a5e69
FS:  00002afe4e9870a0(0000) GS:ffff81002053dac0(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000028 CR3: 000000000248f000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process bash (pid: 16811, threadinfo ffff8100049a4000, task ffff810001d91750)
Stack:  0000000000000000 ffffffff880ae959 ffffffff880b4680 ffffffff802be87b
 0000000000000006 0000000000000000 ffff81000e081700 ffffffff802bfb83
 ffff8100205f5d70 ffff81000e081000 0000000000000000 ffffffff880a4bd6
Call Trace:
 [<ffffffff802be87b>] sysfs_get_dirent+0x21/0x6c
 [<ffffffff802bfb83>] sysfs_remove_group+0x1b/0x92
 [<ffffffff880a4bd6>] :bonding:bond_release_and_destroy+0x3d/0x44
 [<ffffffff880aa685>] :bonding:bonding_store_slaves+0x29a/0x352
 [<ffffffff8038a0c7>] dev_attr_store+0x1c/0x1e
 [<ffffffff802be03d>] sysfs_write_file+0xca/0xfc
 [<ffffffff802832fa>] vfs_write+0xae/0x130
 [<ffffffff8028343b>] sys_write+0x45/0x6e
 [<ffffffff8020b68e>] system_call+0x7e/0x83


Code: 48 8b 5f 28 48 85 db 74 1c 48 8b 7b 18 48 89 ee e8 f6 b6 06 RIP [<ffffffff802be76f>] sysfs_find_dirent+0x7/0x36
 RSP <ffff8100049a5dd8>
CR2: 0000000000000028


here's the script I use to set the bond  & do the enslavement
#!/bin/bash

SLAVE_A=ib0
SLAVE_B=ib1
ADDR=192.168.10.118

#SLAVE_A=eth0
#SLAVE_B=eth1
#ADDR=172.30.10.6

/sbin/modprobe bonding

echo 1 > /sys/class/net/bond0/bonding/mode
echo 100 > /sys/class/net/bond0/bonding/miimon

/sbin/modprobe ib_ipoib

echo +$SLAVE_A > /sys/class/net/bond0/bonding/slaves
echo +$SLAVE_B > /sys/class/net/bond0/bonding/slaves

ifconfig bond0 $ADDR

Or.

Attachment: config-2.6.24-rc1.bz2
Description: Binary data

Reply via email to