I'm attempting to set up the bonding driver on two gretap interfaces, gretap15 and gretap16 but I'm observing unexpected (to me) behaviour. The underlying interfaces for those two are respectively intra15 (ipv4: 10.88.15.100/24) and intra16 (ipv4: 10.88.16.100/24). These two are e1000 virtual network cards, connected through virtual cables. As such, I would exclude any hardware issues. As a peer, I have another Linux system configured similarly (ipv4s: 10.88.15.200 on intra15, 10.88.16.200 on intra16).
The gretap tunnels work as expected. They have the following ipv4 addresses: host peer gretap15 10.188.15.100 10.188.15.200 gretap16 10.188.16.100 10.188.16.200 When not enslaved by the bond interface, I'm able to exchange packets in the tunnel using the internal ip addresses. I then set up the bonding driver as follows: # ip link add bond-15-16 type bond # ip link set bond-15-16 type bond mode active-backup # ip link set gretap15 down # ip link set gretap16 down # ip link set gretap15 master bond-15-16 # ip link set gretap16 master bond-15-16 # ip link set bond-15-16 mtu 1462 # ip addr add 10.42.42.100/24 dev bond-15-16 # ip link set bond-15-16 type bond arp_interval 100 arp_ip_target 10.42.42.200 # ip link set bond-15-16 up I do the same on the peer system, inverting the interface and ARP target IP addresses. At this point, IP communication using the addresses on the bond interfaces works as expected. E.g. # ping 10.24.24.200 gets responses from the other peer. Using tcpdump on the other peer shows the GRE packets coming into intra15, and identical ICMP packets coming through gretap15 and bond-15-16. If I then disconnect the (virtual) network cable of intra15, the bonding driver switches to intra16, as the GRE tunnel can no longer pass packets. However, despite having primary_reselect=0, when I reconnect the network cable of intra15, the driver doesn't switch back to gretap15. In fact, it doesn't even attempt sending any probes through it. Fiddling with the cables (e.g. reconnecting intra15 and then disconnecting intra16) and/or bringing the bond interface down and up usually results in the driver ping-ponging a bit between gretap15 and gretap16, before usually settling on gretap16 (but never on gretap15, it seems). Or, sometimes, it results in the driver marking both slaves down and not doing anything ever again until manual intervention (e.g. manually selecting a new active_slave, or down -> up). Trying to ping the gretap15 address of the peer (10.188.15.200) from the host while gretap16 is the active slave results in ARP traffic being temporarily exchanged on gretap15. I'm not sure whether it originates from the bonding driver, as it seems like the generated requests are the cartesian product of all address couples on the network segments of gretap15 and bond-15-16 (e.g. who-has 10.188.15.100 tell 10.188.15.100, who-has 10.188.15.100 tell 10.188.15.200, ..., who-hash 10.42.42.200 tell 10.42.42.200). uname -a: Linux fo-gw 4.19.0-9-amd64 #1 SMP Debian 4.19.118-2+deb10u1 (2020-06-07) x86_64 GNU/Linux (same on peer system) Am I misunderstanding how the driver works? Have I made any mistakes in the configuration? Best regards, Riccardo P. Bestetti