Now that I have remote-IPMI and SOL working my next step is to try and crash Linux to see if there might be "pathological crash cases" where I will end up having to go to the server room. So far, whatever I do I'm pleasantly surprised that "chassis power cycle" always seems to work!
I tried: `echo "c" > /proc/sysrq-trigger` to produce kernel panic. The node still reboots on its IPMI interface. What surprised me was that even if I take down my eth interface with a ifdown the IPMI still works. How does it do that? I mean I am using the shared NIC approach and I was expecting the IPMI to clam up the moment the OS took a port down. On Sept 30 Joe Landman said: >After years of configuring and helping run/manage both, we recommend strongly >*against* the shared physical connector approach. The extra cost/hassle of >the extra cheap >switch and wires is well worth the money. >Why do we take this view? Many reasons, but some of the bigger ones are (I know Joe Landman and others had warned me against this but I tried to start with configuring a single shared NIC and then go for two NICs. Just keeping things simple to start with.) But my single shared NIC results seem good enough already. Which is why I was trying to see if there are any worse possibilities of crashes that will render contacting the IPMI impossible. On Sept 30 Joe Landman said: >a) when the OS takes the port down, your IPMI no longer responds to arp >requests. Which means ping, and any other service (IPMI) will fail without a >continuous updating of the >arp tables, or a forced hardwire of those ips to >those mac addresses. Another point that surprises me is how the IPMI kept working even after CentOS took the port down. I definitely see Joe Landman's arguments about why it shouldn't be responding to ARP's any more (unless I did something special). That's why I am a bit surprised that my IPMI I/P continues to respond to the pings even after the primary I/P is dead. #Ping primary I/P address ping 10.0.0.25 [no response] #Ping IPMI IP address ping 10.0.0.26 PING 10.0.0.26 (10.0.0.26) 56(84) bytes of data. 64 bytes from 10.0.0.26: icmp_seq=1 ttl=64 time=0.574 ms 64 bytes from 10.0.0.26: icmp_seq=2 ttl=64 time=0.485 ms Interestingly arp shows the primary IP as incomplete but the secondary IP resolves to the correct IP. This means that the BMC continues to respond to the second MAC even after the OS took the eth port down. How exactly does this "magic" happen. I'm just curious. node25 (incomplete) bond0 10.0.0.26 ether 00:24:E8:63:D6:9E C bond0 Another mysterious observation was this: Whenever I took eth down via the OS there is a latent period when the IPMI stops responding but then somehow it magically resurrects itself and starts working again. Just making sure this isn't a fluke case......Any comments or more disaster scenario simulations are welcome! -- Rahul _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf