On Tue, 13 Feb 2001, Lingel, Jason wrote:

> We just brought up 4 dual gigahertz processor Dell 1400 RedHat Linux 7.0
> machines with memory ranging from 256Mb to 1GB.  I can't remember the kernel
> number off hand but I used the kernel that Red Hat sent on CD.  Installed as
> a server from the gui, added nsf server services and didn't include web,
> news or nis.  These are all networked computers.  I don't run X on any of
> these.  They are used to run numerical models that generally tend to pound
> the processors, but that shouldn't be a problem.  I look at them today and 2
> of them are dead in the water -- they're on but you can't ping, rsh or
> telnet.  They are looking at dns servers, but not NIS maps.  No samba.
>
> If anybody has any ideas as to why, I would like to hear them.  If anybody
> has any methodology for troubleshooting these kinds of things, I would
> appreciate that as well.

Before setting up a machine for real, it is good to verify the hardware is
working properly. I usually do this by making a basic install with
compiler and the kernel source. Then compile the kernel several times. If
that works, then add a '-j' parameter to the MAKE variable in Makefile for
the kernel (see Documentation/smp.txt in the kernel source).  Recompile
the kernel several more times and examine top and such to make sure the
load is high and ALL RAM is in use for something. For your level of
machines, having several different kernel compiles running at the same
time might be needed, depends whether all RAM is finally used for caching
or not.

If the kernel compile fails with SIG 11s or just at different points in
the run, you are looking at several potential problems, all of which are
hardware related:

1) Corrupt RAM. This leads to corrupted files from file caching. Run
something like memtest86 to verify the RAM. You won't typically see kernel
error messages for this, rather the machine just locks up.

2) I/O subsystem problems. Something is corrupting files on their way to
disk. Usually you'll see error messages if this is the case.

3) Heat. Drives, CPU's or power supply. Anything that starts to wheeze
when the machine is busy could affect the stability of the machine. CPU's
overheating typically won't produce error messages and certainly not in
any traceable pattern. Same for a flaky power supply, once the voltages
start fluctuating, nothing will work quite right. Drives, again, will
usually get a kernel messages.

That covers 99% of the cases, after that comes driver issues though well
written drivers produce error messages that make sense.

HTH,

Bill Carlson
-- 
Systems Programmer    [EMAIL PROTECTED]    |  Opinions are mine,
Virtual Hospital      http://www.vh.org/        |  not my employer's.
University of Iowa Hospitals and Clinics        |



_______________________________________________
Redhat-list mailing list
[EMAIL PROTECTED]
https://listman.redhat.com/mailman/listinfo/redhat-list

Reply via email to