Jay, MoniI did some tests with 2.6.24-rc1 and the first patch to bonding that Jay sent last night to netdev. Basic operation and fail over work fine. However, I see some crashes which are somehow related to destroying the bond when the slaves are ipoib ones, I don't see similar crashes when enslaving ethernet devices (Broadcom Corporation NetXtreme BCM5704 Gigabit Ethernet (rev 03)), my compressed dot config is attached.
The first type of oops is when I just do modprobe -r bonding after enslavement of the ipoib devices:
Ethernet Channel Bonding Driver: v3.2.1 (October 15, 2007) bonding: MII link monitoring set to 100 ms bonding: bond0: setting mode to active-backup (1). bonding: bond0: Setting MII monitoring interval to 100. NET: Registered protocol family 10 ADDRCONF(NETDEV_UP): bond0: link is not ready bonding: bond0: doing slave updates when interface is down. bonding: bond0: Adding slave ib0. bonding bond0: master_dev is not up in bond_enslave bonding: bond0: Warning: enslaved VLAN challenged slave ib0. Adding VLANs will be blocked as long as ib0 is part of bond bond0 bonding: bond0: enslaving ib0 as a backup interface with a down link. bonding: bond0: doing slave updates when interface is down. bonding: bond0: Adding slave ib1. bonding bond0: master_dev is not up in bond_enslave bonding: bond0: Warning: enslaved VLAN challenged slave ib1. Adding VLANs will be blocked as long as ib1 is part of bond bond0 bonding: bond0: enslaving ib1 as a backup interface with a down link. ADDRCONF(NETDEV_UP): bond0: link is not ready bonding: bond0: link status definitely up for interface ib0. bonding: bond0: link status definitely up for interface ib1. bonding: bond0: making interface ib0 the new active one. bonding: bond0: first active interface up! ADDRCONF(NETDEV_CHANGE): bond0: link becomes ready eth0: no IPv6 routers present bond0: no IPv6 routers present bonding: bond0: released all slavesUnable to handle kernel paging request at ffffffff880a07ce RIP: [<ffffffff880a07ce>]PGD 203067 PUD 207063 PMD 2060f067 PTE 0Oops: 0010 [1] SMP CPU 0 Modules linked in: ib_ipoib ib_cm ib_sa ipv6 sg st sd_mod sr_mod scsi_mod e100 ib_mthca ib_mad ib_core i2c_amd8111 i2c_corePid: 14604, comm: bond0 Not tainted 2.6.24-rc1 #1 RIP: 0010:[<ffffffff880a07ce>] [<ffffffff880a07ce>] RSP: 0018:ffff810008439e98 EFLAGS: 00010247 RAX: ffff810004da20c0 RBX: ffff810004da20c0 RCX: ffff81000315aa68 RDX: ffff810004da20c8 RSI: ffff810008439ef0 RDI: ffff81000315aa60 RBP: ffffffff880a07ce R08: ffff810008438000 R09: ffff81000152d0d8 R10: ffff810004da20c0 R11: ffff810009574000 R12: 00000000fffffffc R13: ffffffffffffffff R14: ffffffff8063b820 R15: 0000000000000000 FS: 00002af0c528b0a0(0000) GS:ffffffff805d4000(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: ffffffff880a07ce CR3: 000000002852f000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process bond0 (pid: 14604, threadinfo ffff810008438000, task ffff8100024970c0) Stack: ffffffff802445c6 ffff810008439f08 ffff810004da20c0 ffffffff80244652 ffffffff8024473f 0000000000000000 ffff8100024970c0 ffffffff80248320 ffff810008439f08 ffff810008439f08 0000000000000000 0000000000000000 Call Trace: [<ffffffff802445c6>] run_workqueue+0x83/0x10f [<ffffffff80244652>] worker_thread+0x0/0xf7 [<ffffffff8024473f>] worker_thread+0xed/0xf7 [<ffffffff80248320>] autoremove_wake_function+0x0/0x2e [<ffffffff80248320>] autoremove_wake_function+0x0/0x2e [<ffffffff80247fe6>] kthread+0x3d/0x63 [<ffffffff8020c4a8>] child_rip+0xa/0x12 [<ffffffff80247fa9>] kthread+0x0/0x63 [<ffffffff8020c49e>] child_rip+0x0/0x12 Code: Bad RIP value. RIP [<ffffffff880a07ce>] RSP <ffff810008439e98> CR2: ffffffff880a07ce
the second type of oops is when I modprobe -r ib_ipoib after enslavement. I was not able to test this one with ethernet as the tg3 code is built into my kernel
Nov 7 14:31:56 dill kernel: bonding: bond0: Setting MII monitoring interval to 100. Nov 7 14:31:56 dill kernel: bonding: bond0: Adding slave ib0. Nov 7 14:31:56 dill kernel: bonding: bond0: Warning: enslaved VLAN challenged slave ib0. Adding VLANs will be blocked as long as ib0 is part of bond bond0 Nov 7 14:31:56 dill kernel: bonding: bond0: Warning: The first slave device specified does not support setting the MAC address. Enabling the fail_over_mac option.<6>bonding: bond0: enslaving ib0 as a backup interface with a down link. Nov 7 14:31:56 dill kernel: bonding: bond0: Adding slave ib1. Nov 7 14:31:56 dill kernel: bonding: bond0: Warning: enslaved VLAN challenged slave ib1. Adding VLANs will be blocked as long as ib1 is part of bond bond0 Nov 7 14:31:56 dill kernel: bonding: bond0: enslaving ib1 as a backup interface with a down link. Nov 7 14:31:56 dill kernel: bonding: bond0: link status definitely up for interface ib0. Nov 7 14:31:56 dill kernel: bonding: bond0: making interface ib0 the new active one. Nov 7 14:31:56 dill kernel: bonding: bond0: first active interface up! Nov 7 14:31:56 dill kernel: ADDRCONF(NETDEV_CHANGE): bond0: link becomes ready Nov 7 14:31:56 dill kernel: ib0: multicast join failed for 0001:ffff:ffff:0a0a:0081:ffff:c0bb:0a0a, status -22 Nov 7 14:31:56 dill kernel: bonding: bond0: link status definitely up for interface ib1. Nov 7 14:31:58 dill kernel: ib0: multicast join failed for 0001:ffff:ffff:0a0a:0081:ffff:c0bb:0a0a, status -22 Nov 7 14:32:02 dill kernel: ib0: multicast join failed for 0001:ffff:ffff:0a0a:0081:ffff:c0bb:0a0a, status -22 Nov 7 14:32:07 dill kernel: bond0: no IPv6 routers present Nov 7 14:32:10 dill kernel: ib0: multicast join failed for 0001:ffff:ffff:0a0a:0081:ffff:c0bb:0a0a, status -22 Nov 7 14:32:12 dill ypbind[14475]: broadcast: RPC: Timed out. Nov 7 14:32:18 dill kernel: ib0: cm send completion event with wrid 1073741823 (> 64) Nov 7 14:32:23 dill kernel: ib0: RX drain timing out Nov 7 14:32:23 dill kernel: bonding: bond0: Warning: the permanent HWaddr of ib0 - 80:06:04:04:fe:80 - is still in use by bond0. Set the HWaddr of ib0 to a different address to avoid conflicts. Nov 7 14:32:23 dill kernel: bonding: bond0: releasing active interface ib0 Nov 7 14:32:23 dill kernel: bonding: bond0: making interface ib1 the new active one. Nov 7 14:32:23 dill kernel: ib1: multicast join failed for 0001:0000:ffff:0000:0000:0000:0070:5229, status -22 Nov 7 14:32:23 dill kernel: bonding: bond0: releasing active interface ib1 Nov 7 14:32:23 dill kernel: bonding: bond0: destroying bond bond0. Nov 7 14:32:23 dill kernel: __dev_addr_discard: address leakage! da_users=1Nov 7 14:32:23 dill kernel: Unable to handle kernel NULL pointer dereference at 0000000000000028 RIP: Nov 7 14:32:23 dill kernel: [<ffffffff802be76f>] sysfs_find_dirent+0x7/0x36 Nov 7 14:32:23 dill kernel: PGD 250a067 PUD 40a4067 PMD 0 Nov 7 14:32:23 dill kernel: Oops: 0000 [1] SMP Nov 7 14:32:23 dill kernel: CPU 1 Nov 7 14:32:23 dill kernel: Modules linked in: ib_ipoib ib_cm ib_sa bonding e100 ipv6 sg st sd_mod sr_mod scsi_mod ib_mthca ib_mad ib_core i2c_amd756 i2c_amd8111 i2c_coreNov 7 14:32:23 dill kernel: Pid: 18870, comm: modprobe Not tainted 2.6.24-rc1 #1 Nov 7 14:32:23 dill kernel: RIP: 0010:[<ffffffff802be76f>] [<ffffffff802be76f>] sysfs_find_dirent+0x7/0x36 Nov 7 14:32:23 dill kernel: RSP: 0018:ffff8100264e3da8 EFLAGS: 00010246 Nov 7 14:32:23 dill kernel: RAX: 0000000000000000 RBX: ffffffff88111959 RCX: 000000000000000a Nov 7 14:32:23 dill kernel: RDX: ffff8100264e3fd8 RSI: ffffffff88111959 RDI: 0000000000000000 Nov 7 14:32:23 dill kernel: RBP: ffffffff88111959 R08: ffff810020705d70 R09: ffff810020012ae8 Nov 7 14:32:23 dill kernel: R10: 0000000000000000 R11: 0000000000000286 R12: 0000000000000000 Nov 7 14:32:23 dill kernel: R13: ffff810028d9e000 R14: 0000000000000006 R15: 0000000000515ab0 Nov 7 14:32:23 dill kernel: FS: 00002adaba330720(0000) GS:ffff81002053dac0(0000) knlGS:0000000000000000 Nov 7 14:32:23 dill kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b Nov 7 14:32:23 dill kernel: CR2: 0000000000000028 CR3: 0000000001cca000 CR4: 00000000000006e0 Nov 7 14:32:23 dill kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 Nov 7 14:32:23 dill kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Nov 7 14:32:23 dill kernel: Process modprobe (pid: 18870, threadinfo ffff8100264e2000, task ffff8100264ed790) Nov 7 14:32:23 dill kernel: Stack: 0000000000000000 ffffffff88111959 ffffffff88117680 ffffffff802be87b Nov 7 14:32:23 dill kernel: 0000000000000006 0000000000000000 ffff810006578700 ffffffff802bfb83 Nov 7 14:32:23 dill kernel: ffff810020705d70 ffff810006578000 0000000000000000 ffffffff88107bd6 Nov 7 14:32:24 dill kernel: Call Trace: Nov 7 14:32:24 dill kernel: [<ffffffff802be87b>] sysfs_get_dirent+0x21/0x6c Nov 7 14:32:24 dill kernel: [<ffffffff802bfb83>] sysfs_remove_group+0x1b/0x92 Nov 7 14:32:24 dill kernel: [<ffffffff88107bd6>] :bonding:bond_release_and_destroy+0x3d/0x44 Nov 7 14:32:24 dill kernel: [<ffffffff88107c92>] :bonding:bond_netdev_event+0xb5/0xca Nov 7 14:32:24 dill kernel: [<ffffffff8046e55e>] notifier_call_chain+0x30/0x54 Nov 7 14:32:24 dill kernel: [<ffffffff8041845d>] unregister_netdevice+0xc3/0x15a Nov 7 14:32:24 dill kernel: [<ffffffff80418505>] unregister_netdev+0x11/0x17 Nov 7 14:32:24 dill kernel: [<ffffffff880f2be4>] :ib_ipoib:ipoib_remove_one+0x64/0xa5 Nov 7 14:32:24 dill kernel: [<ffffffff88015069>] :ib_core:ib_unregister_client+0x43/0xfe Nov 7 14:32:24 dill kernel: [<ffffffff880fb071>] :ib_ipoib:ipoib_cleanup_module+0xd/0x2b Nov 7 14:32:24 dill kernel: [<ffffffff802557b1>] sys_delete_module+0x1b1/0x1e2 Nov 7 14:32:24 dill kernel: [<ffffffff80329b00>] __downgrade_write+0x5f/0xb1 Nov 7 14:32:24 dill kernel: [<ffffffff8026eb2e>] sys_munmap+0x4a/0x56 Nov 7 14:32:24 dill kernel: [<ffffffff8020b68e>] system_call+0x7e/0x83Nov 7 14:32:24 dill kernel: Nov 7 14:32:24 dill kernel: Nov 7 14:32:24 dill kernel: Code: 48 8b 5f 28 48 85 db 74 1c 48 8b 7b 18 48 89 ee e8 f6 b6 06 Nov 7 14:32:24 dill kernel: RIP [<ffffffff802be76f>] sysfs_find_dirent+0x7/0x36Nov 7 14:32:24 dill kernel: RSP <ffff8100264e3da8> Nov 7 14:32:24 dill kernel: CR2: 0000000000000028
the third type of oops is when I did some fail overs, then removed both slaves from the bond using
echo -$slave > /sys/class/net/bond0/bonding/slaves
Ethernet Channel Bonding Driver: v3.2.1 (October 15, 2007) bonding: MII link monitoring set to 100 ms bonding: bond0: setting mode to active-backup (1). bonding: bond0: Setting MII monitoring interval to 100. ADDRCONF(NETDEV_UP): bond0: link is not ready bonding: bond0: doing slave updates when interface is down. bonding: bond0: Adding slave ib0. bonding bond0: master_dev is not up in bond_enslave bonding: bond0: Warning: enslaved VLAN challenged slave ib0. Adding VLANs will be blocked as long as ib0 is part of bond bond0 bonding: bond0: enslaving ib0 as a backup interface with a down link. bonding: bond0: doing slave updates when interface is down. bonding: bond0: Adding slave ib1. bonding bond0: master_dev is not up in bond_enslave bonding: bond0: Warning: enslaved VLAN challenged slave ib1. Adding VLANs will be blocked as long as ib1 is part of bond bond0 bonding: bond0: enslaving ib1 as a backup interface with a down link. ADDRCONF(NETDEV_UP): bond0: link is not ready bonding: bond0: link status definitely up for interface ib0. bonding: bond0: link status definitely up for interface ib1. bonding: bond0: making interface ib0 the new active one. bonding: bond0: first active interface up! ADDRCONF(NETDEV_CHANGE): bond0: link becomes ready bond0: no IPv6 routers present bonding: bond0: link status definitely down for interface ib0, disabling it bonding: bond0: making interface ib1 the new active one. bonding: bond0: link status definitely up for interface ib0. bonding: bond0: link status definitely down for interface ib1, disabling it bonding: bond0: making interface ib0 the new active one. bonding: bond0: Removing slave ib0 bonding: bond0: Warning: the permanent HWaddr of ib0 - 80:08:04:04:fe:80 - is still in use by bond0. Set the HWaddr of ib0 to a different address to avoid conflicts. bonding: bond0: releasing active interface ib0 bonding: bond0: Removing slave ib1 bonding: bond0: releasing backup interface ib1 bonding: bond0: destroying bond bond0.Unable to handle kernel NULL pointer dereference at 0000000000000028 RIP: [<ffffffff802be76f>] sysfs_find_dirent+0x7/0x36 PGD 48a0067 PUD 285f067 PMD 0 Oops: 0000 [1] SMP CPU 1 Modules linked in: ib_ipoib ib_cm ib_sa bonding ipv6 sg st sd_mod sr_mod scsi_mod e100 ib_mthca ib_mad ib_core i2c_amd756 i2c_amd8111 i2c_corePid: 16811, comm: bash Not tainted 2.6.24-rc1 #1 RIP: 0010:[<ffffffff802be76f>] [<ffffffff802be76f>] sysfs_find_dirent+0x7/0x36 RSP: 0018:ffff8100049a5dd8 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffffffff880ae959 RCX: 0000000000000002 RDX: ffff8100049a5fd8 RSI: ffffffff880ae959 RDI: 0000000000000000 RBP: ffffffff880ae959 R08: ffff8100205f5d70 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000 R13: 0000000000000000 R14: ffff8100300c7000 R15: ffff8100049a5e69 FS: 00002afe4e9870a0(0000) GS:ffff81002053dac0(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000028 CR3: 000000000248f000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process bash (pid: 16811, threadinfo ffff8100049a4000, task ffff810001d91750) Stack: 0000000000000000 ffffffff880ae959 ffffffff880b4680 ffffffff802be87b 0000000000000006 0000000000000000 ffff81000e081700 ffffffff802bfb83 ffff8100205f5d70 ffff81000e081000 0000000000000000 ffffffff880a4bd6 Call Trace: [<ffffffff802be87b>] sysfs_get_dirent+0x21/0x6c [<ffffffff802bfb83>] sysfs_remove_group+0x1b/0x92 [<ffffffff880a4bd6>] :bonding:bond_release_and_destroy+0x3d/0x44 [<ffffffff880aa685>] :bonding:bonding_store_slaves+0x29a/0x352 [<ffffffff8038a0c7>] dev_attr_store+0x1c/0x1e [<ffffffff802be03d>] sysfs_write_file+0xca/0xfc [<ffffffff802832fa>] vfs_write+0xae/0x130 [<ffffffff8028343b>] sys_write+0x45/0x6e [<ffffffff8020b68e>] system_call+0x7e/0x83Code: 48 8b 5f 28 48 85 db 74 1c 48 8b 7b 18 48 89 ee e8 f6 b6 06 RIP [<ffffffff802be76f>] sysfs_find_dirent+0x7/0x36RSP <ffff8100049a5dd8> CR2: 0000000000000028
here's the script I use to set the bond & do the enslavement
#!/bin/bash SLAVE_A=ib0 SLAVE_B=ib1 ADDR=192.168.10.118 #SLAVE_A=eth0 #SLAVE_B=eth1 #ADDR=172.30.10.6 /sbin/modprobe bonding echo 1 > /sys/class/net/bond0/bonding/mode echo 100 > /sys/class/net/bond0/bonding/miimon /sbin/modprobe ib_ipoib echo +$SLAVE_A > /sys/class/net/bond0/bonding/slaves echo +$SLAVE_B > /sys/class/net/bond0/bonding/slaves ifconfig bond0 $ADDR
Or.
config-2.6.24-rc1.bz2
Description: Binary data