Thanks for the feedback.
After copying /boot and /bin from another machine and mucking about with
grub for far too long (had to edit grub.conf to change virtual disk
names, and in CentOS's rescue disk it saw the boot disk as hd1, but when
grub actually started, it saw it as hd0) the system is back on line.
The logs don't show a root command line that specifically took out those
directories. They do show a bunch of scripts being run. My best guess
is that one of them did something like this:
AVAR=`command that failed and returned an empty string`
rm -rf ${AVAR}/b*
It seems unlikely that a low level controller failure would have snipped
out those files/directories without resulting in a file system that was
seen as corrupt by fsck.
That said, there is something hardware related going on, since
/var/log/messages has a lot of these (sorry about the wrap):
Mar 16 12:37:27 mandolin kernel: sd 7:0:0:0: [sdb] Sense Key :
Recovered Error [current] [descriptor]
Mar 16 12:37:27 mandolin kernel: Descriptor sense data with sense
descriptors (in hex):
Mar 16 12:37:27 mandolin kernel: 72 01 04 1d 00 00 00 0e 09 0c 00
00 00 00 00 00
Mar 16 12:37:27 mandolin kernel: 00 4f 00 c2 40 50
Mar 16 12:37:27 mandolin kernel: sd 7:0:0:0: [sdb] ASC=0x4 ASCQ=0x1d
That group has several other similar Dell servers, and this is the only
one logging these. sdb1 holds /boot and sdb2 is where the lvm keeps its
information.
Regards,
David Mathog
[email protected]
Manager, Sequence Analysis Facility, Biology Division, Caltech
_______________________________________________
Beowulf mailing list, [email protected] sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf