I've got a few-hundred node cluster here that I've had humming along for several years. All the nodes are set to PXE boot. The default entry in the PXE menu is to boot off the local hard drive, and we drop in a kickstart if need be (new nodes, node refreshes, I just feel like it, etc). I'm currently moving the cluster from CentOS-6 to CentOS-7. At the same time, I have ~200 nodes with onboard 10GBase-T NICs (X540-AT2 based) that had been plugged into 1Gbps switches (from Brocade) that I'm moving over to 10Gbps switches (Cisco Nexus C93120TX). The ones I'm currently working with have fairly short cable runs (<7ft), and are using Cat 6a cables.
I'm running into a major issue where a large percentage (well over 50) of attempted PXE kickstarts fails. The failures occur in multiple places, but all seem to be related to slow initialization of the network interface. I've seen: 1) dracut-initqueue timeouts leading to "/dev/root does not exist" 2) the node loads the kickstart file but then fails while trying to read the repo metadata. 3) the kickstart actually succeeds, but during reboot a bunch of network services (NFS mounts, SGE, etc) attempt to start but fail because the network isn't fully up yet. To fix things, I've tried: 1) adding "inst.waitfornet=120 rd.net.timeout.carrier=120 rd.net.timeout.iflink=100 rd.net.timeout.ifup=120 rd.net.dhcp.retry=5" to the kernel parameters in the PXE menu *and* the default grub parameters 2) adding "LINKDELAY=120" to the ifcfg-$INTERFACE scripts (still using the network service here, not NetworkManager) 3) turning on PortFast on the network ports, i.e. "spanning-tree port type edge". Nothing has really made a huge difference. PortFast seemed to at first, but larger scale tests still have rather high failure rates. Has anyone seen anything like this? And, more importantly, has anyone fixed it? Thanks! -- Joshua Baker-LePain QB3 Shared Cluster Sysadmin UCSF _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf