On Thu, Aug 19, 2010 at 01:49:17PM -0400, Peter Bisroev wrote: > Hi All, > > I have tried searching the mailing lists but did not seem to find the > answer to the issue that I am seeing. I apologize for the long email, > but more info is better than less :) >
... > For the tests performed all switches were unmanaged Netgear JFS516 and > JGS516. The hosts we wired as follows: > ------------------------------------------ > test00:em2 ----- 1:sw0a:2 ----- 1:sw0c > test01:em2 ----- 1:sw0b:2 ----- 2:sw0c > > test00:em3 ----- 1:sw1a:2 ----- 1:sw1c > test01:em3 ----- 1:sw1b:2 ----- 2:sw1c > > test00:bge1 ----- 1:sw2a:2 ----- 1:sw2c > test01:re1 ----- 1:sw2b:2 ----- 2:sw2c > ------------------------------------------ > For example, the last line shows that re1 on test01 was connected to > port 1 on switch sw2b and port 2 from sw2b was connected to port 2 on > switch sw2c. (I would have loved to draw an ASCII diagram but it got > too complex.) The reason for so many switches is to approximate > different failure scenarios. > > For the first test unplug the cable between sw0a:2 and sw0c:1, and as > expected the log on test01 shows: > ------------------------------------------ > test01 /bsd: carp0: state transition: BACKUP -> MASTER > > # ifconfig carp > carp0: flags=28843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST,NOINET6> mtu 1500 > lladdr 00:00:5e:00:01:01 > priority: 0 > carp: MASTER carpdev em2 vhid 1 advbase 1 advskew 100 > groups: carp > status: master > inet 192.168.1.1 netmask 0xffffff00 broadcast 192.168.1.255 > carp1: flags=28843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST,NOINET6> mtu 1500 > lladdr 00:00:5e:00:01:02 > priority: 0 > carp: BACKUP carpdev em3 vhid 2 advbase 1 advskew 100 > groups: carp > status: backup > inet 192.168.2.1 netmask 0xffffff00 broadcast 192.168.2.255 > carp2: flags=28843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST,NOINET6> mtu 1500 > lladdr 00:00:5e:00:01:03 > priority: 0 > carp: BACKUP carpdev re1 vhid 3 advbase 1 advskew 100 > groups: carp > status: backup > inet 192.168.3.1 netmask 0xffffff00 broadcast 192.168.3.255 > ------------------------------------------ > Yes, since you disconnected the link between the carp interfaces without dropping their physical connections both will become MASTER. This normaly results in havoc since bad luck will flow all traffic to the wrong box. This is the typical problem with too much redundancy (the result has more error cases and is often less stable). > For the second test unplug the cable between test00:em2 and sw0a:1. > Now the results are not what I have expected. The log on test00 shows > the following: > ------------------------------------------ > test00 /bsd: carp0: state transition: MASTER -> INIT > test00 /bsd: carp: carp0 demoted group carp to 1 > test00 /bsd: carp1: state transition: MASTER -> BACKUP > test00 /bsd: carp2: state transition: MASTER -> BACKUP > > # ifconfig carp > carp0: flags=8803<UP,BROADCAST,SIMPLEX,MULTICAST> mtu 1500 > lladdr 00:00:5e:00:01:01 > priority: 0 > carp: INIT carpdev em2 vhid 1 advbase 1 advskew 0 > groups: carp > inet 192.168.1.1 netmask 0xffffff00 broadcast 192.168.1.255 > carp1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500 > lladdr 00:00:5e:00:01:02 > priority: 0 > carp: BACKUP carpdev em3 vhid 2 advbase 1 advskew 0 > groups: carp > inet 192.168.2.1 netmask 0xffffff00 broadcast 192.168.2.255 > carp2: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> mtu 1500 > lladdr 00:00:5e:00:01:03 > priority: 0 > carp: BACKUP carpdev bge1 vhid 3 advbase 1 advskew 0 > groups: carp > inet 192.168.3.1 netmask 0xffffff00 broadcast 192.168.3.255 > # ifconfig em2 > em2: flags=8b43<UP,BROADCAST,RUNNING,PROMISC,ALLMULTI,SIMPLEX,MULTICAST> > mtu 1500 > lladdr 00:15:17:b8:db:3e > priority: 0 > media: Ethernet autoselect (none) > status: no carrier > ------------------------------------------ > > And the log on test01 shows the following: > ------------------------------------------ > test01 /bsd: carp1: state transition: BACKUP -> MASTER > test01 /bsd: carp2: state transition: BACKUP -> MASTER > test01 /bsd: carp0: state transition: BACKUP -> MASTER > ------------------------------------------ > > Plugging the cable back in brings the system back to the original > state as shown in the logs below: > ------------------------------------------ > test00 /bsd: carp0: state transition: INIT -> BACKUP > test00 /bsd: carp: carp0 demoted group carp to 0 > test00 /bsd: carp0: state transition: BACKUP -> MASTER > test00 /bsd: carp1: state transition: BACKUP -> MASTER > test00 /bsd: carp2: state transition: BACKUP -> MASTER > > > test01 /bsd: carp0: state transition: MASTER -> BACKUP > test01 /bsd: carp1: state transition: MASTER -> BACKUP > test01 /bsd: carp2: state transition: MASTER -> BACKUP > ------------------------------------------ > > The same behavior can be reproduced with any of the other interfaces. > Swapping the roles of both machines yields the same result. Repeating > the test on the 4.8-current branch yields the same result as well. > > Based on the above examples, what is the reason that the behavior of > carpdev is different between the two tests? Physically, the only > difference as seen by the host test00 in the second test is that the > underlaying interface em2 of carp0 changes status from 'active' to 'no > carrier'. Is this behavior expected? Or should the second test behave > as the first one? What is the reason for carpdev to demote the entire > carp group on test00? Yes, this is expected behaviour. You unplugged the cable to the physical interface. Because of that the link state goes down and carp(4) noticed it. Since that interface is no longer useable the carp demotion counter is raised and so the other system (where all 3 carps are still fine) will take over (the backup box is considered in better shape and becomes master). In your first case no interface failed and so both systems became master (at least on that interface where the LAN was cut in two segments). In this case both systems are equally healthy and so the demotion counter is not raised. > If this seems like a bug I would be more than happy to assist with testing. > It is not a bug, it is the intended behaviour. -- :wq Claudio