Cisco's website info on PortFast makes me wonder how it did you any good at all, while in a transition. Any misconfiguration could block all ports, some configurations being "type-inconsistent." I love these puzzles and will watch this carefully. Sorry I cannot be of more help. Jonathan Engwall
On Thu, Feb 28, 2019, 2:54 PM Joshua Baker-LePain < joshua.bakerlep...@gmail.com> wrote: > I've got a few-hundred node cluster here that I've had humming along > for several years. All the nodes are set to PXE boot. The default > entry in the PXE menu is to boot off the local hard drive, and we drop > in a kickstart if need be (new nodes, node refreshes, I just feel like > it, etc). I'm currently moving the cluster from CentOS-6 to CentOS-7. > At the same time, I have ~200 nodes with onboard 10GBase-T NICs > (X540-AT2 based) that had been plugged into 1Gbps switches (from > Brocade) that I'm moving over to 10Gbps switches (Cisco Nexus > C93120TX). The ones I'm currently working with have fairly short > cable runs (<7ft), and are using Cat 6a cables. > > I'm running into a major issue where a large percentage (well over 50) > of attempted PXE kickstarts fails. The failures occur in multiple > places, but all seem to be related to slow initialization of the > network interface. I've seen: > > 1) dracut-initqueue timeouts leading to "/dev/root does not exist" > > 2) the node loads the kickstart file but then fails while trying to > read the repo metadata. > > 3) the kickstart actually succeeds, but during reboot a bunch of > network services (NFS mounts, SGE, etc) attempt to start but fail > because the network isn't fully up yet. > > To fix things, I've tried: > > 1) adding "inst.waitfornet=120 rd.net.timeout.carrier=120 > rd.net.timeout.iflink=100 rd.net.timeout.ifup=120 rd.net.dhcp.retry=5" > to the kernel parameters in the PXE menu *and* the default grub > parameters > > 2) adding "LINKDELAY=120" to the ifcfg-$INTERFACE scripts (still using > the network service here, not NetworkManager) > > 3) turning on PortFast on the network ports, i.e. "spanning-tree port > type edge". > > Nothing has really made a huge difference. PortFast seemed to at > first, but larger scale tests still have rather high failure rates. > Has anyone seen anything like this? And, more importantly, has anyone > fixed it? Thanks! > > -- > Joshua Baker-LePain > QB3 Shared Cluster Sysadmin > UCSF > _______________________________________________ > Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf >
_______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf