[Beowulf] Re: cluster fails to boot with managed switch, but 5-port switch works OK

Art Poon Wed, 02 Dec 2009 10:22:07 -0800

Dear colleagues,

I am in charge of managing a cluster at our research centre and am stuck with a 
vexing (to me) problem!


(Disclaimer: I am a biologist by training and a mostly self-taught programmer.  
I am still learning about networking and cluster management, so please bear 
with me!)

This is an asymmetric Intel Xeon cluster running 4 compute nodes on CentOS 5.4 
and Scyld Clusterware 5.  We managed to get it up and running using a dinky 
little NetGear 5-port 10/100/1000 switch.  Now that I'm looking to expand the 
cluster, I need to get the managed switch working (an SMC 8824M, though we have 
several other switches available).

What's got me and the IT guys stumped is that while the compute nodes boot via 
PXE from the head node without trouble on the NetGear, they barf with the SMC.  
To be specific, after the initial boot with a minimal Linux kernel, there is a 
"fatal error" with "timeout waiting for getfile" when the compute node attempts 
to download the provisioning image from head.  However, when they were running 
Rocks before I arrived, the cluster worked fine with the SMC switch.

I've tried resetting the SMC switch to factory defaults (with auto-negotiate 
on).  I've checked the /etc/beowulf/modprobe.conf and it doesn't seem to be 
demanding anything exotic.  We've tried swapping out to another SMC switch but 
that didn't change anything.  

I'm grateful if you could weigh in with your expertise.

Thank you,
- Art.


_______________________________________________
Beowulf mailing list, [email protected] sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

[Beowulf] Re: cluster fails to boot with managed switch, but 5-port switch works OK

Reply via email to