Art Poon wrote:
Dear colleagues,
[...]
What's got me and the IT guys stumped is that while the compute nodes boot via PXE from the head node without trouble on the NetGear, they barf with the SMC. To be specific, after the initial boot with a minimal Linux kernel, there is a "fatal error" with "timeout waiting for getfile" when the compute node attempts to download the provisioning image from head. However, when they were running Rocks before I arrived, the cluster worked fine with the SMC switch.
Is it the switch of the dhcp/bootp/tftp setup thats the problem? Are you sure the tftp daemon is up, or bootp is configured correctly?
Switches sometimes have broadcast storm suppression turned on, or worse, sometimes they have spanning tree turned on. You want the switch to be as dumb as you can possibly make it for most linux clusters. Fast, but dumb.
I've tried resetting the SMC switch to factory defaults (with auto-negotiate on). I've checked the /etc/beowulf/modprobe.conf and it doesn't seem to be demanding anything exotic. We've tried swapping out to another SMC switch but that didn't change anything.
This sounds more on the server software stack than the switch. Could you describe this? Are you using Scyld/Rocks for that?
Rocks is quite sensitive to configuration issues, and really doesn't like altered configurations (it is possible to do, though non-trivial).
-- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: [email protected] web : http://scalableinformatics.com http://scalableinformatics.com/jackrabbit phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 _______________________________________________ Beowulf mailing list, [email protected] sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
