So, one of our Tyan S2466 nodes finally gave up the ghost.  PS is ok
(tried a known good spare too), replaced battery on Mobo, the fans spin,
the ethernet flashes, but it won't so much as beep and there's no BIOS
video, let alone disk activity.  Probably a blown CPU or motherboard.
Anyway, the failed hardware is another story.

The odd thing was that I found this when a submitted job blew up when it
couldn't connect by PVM to the dead node.  Couldn't ping it either.  On
logging into another node, gstat still showed the dead one was shown,
looking just like the others, here the first one is dead and the second
live:

monkey02.cluster    1 (    0/   54) [  0.00,  0.00,  0.00] [   0.0,  
0.0,   0.0, 100.0,   0.0] ON
monkey03.cluster    1 (    0/   50) [  0.00,  0.00,  0.00] [   0.0,  
0.0,   0.0, 100.0,   0.0] ON

also

 Dead Hosts: 0
Gexec Hosts: 20

Now normally when I shut down ganglia, or shut down a node, the values
in gstat are correct, yet here, they were not.  The dead node probably 
rolled over and died none too gracefully, so it never TOLD ganglia it
was going away.  Odd though that gangia seems not to have figured it out
for itself.  The ganglia version is ganglia-core-3.0.4-1mdv2007.1.

Then "service gmond restart" on that one node, and it came up showing
itself as a gexec host, but none of the others.  It was necessary to
restart gmond on all nodes to pick up the expected 19 gexec hosts.

Seems like that one node exiting abnormally did a number on ganglia.

Anybody else seen this before?

Regards,

David Mathog
[EMAIL PROTECTED]
Manager, Sequence Analysis Facility, Biology Division, Caltech
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to