Re: Some complex issues - need a few Gurus

Mikkel L. Ellertson Tue, 27 Mar 2001 15:48:22 -0800
On Tue, 27 Mar 2001, Ted Hilts wrote:

> Context: I have this special assembly database server with a system
> drive (boot section and linux OS)-10 gig, a drive for temporary
> determinations - 15gig, an IDE-RAID configuration of 4 x 20 gig drives
> all software based from the kernel. And a heavy duty power supply.  The
> system has also a HP CD-writer and a DVD rom, a tape drive and floppy.
>
> What happenned:  I had the computer running for about 3 days (3x24 hrs)
> when it suddenly crashed.  The error was kernel panic and the inability
> to access some list. This happenned again when booting up while right in
> the middle of fschk. I tried a floppy based restore (kernel and system
> files in RAM with no hard disk utilited, or at least under my
> discretion) and the system again failed.  So I swapped the two 256meg
> memory cards and tried again with the regular linux boot (thus bringing
> up full linux system) as I could no longer get failure with the rescue
> system.  Crashed again, same error!!! No big deal that was just the
> first thing I looked at.  So I went back to the rescue system and tried
> to create a heavy load condition but could not get it to crash.  So back
> to full system mode (but with the problem of a contaminated disk which I
> finally sorted out, as the system tried and gave up, won't get into
> that). I put the system in maintenance mode and left it running for a
> very long time and no failure.  This was after the software repair
> episode with the 4 disk raid array.
>
[snip]
>
> But through all of this I feel that I have failed.  I have gone from
> suspecting physical memory, to CPU, to fans that are not doing the job,
> (one definitely is in trouble thus possibly leading to fan load on power
> supply causing spike and affecting CPU board operation), to ground
> conditions because it is part of a network (but that's not the problem),
> to two possibilities left in my mind, the CPU is not what I ordered and
> is over clocked causing it to miss a beat somewhere and get confused (or
> it has degraded to this because of initial heat problems) or second,
> there is a disk surface error that over time results in a resident file
> on the raid array to degrade(do to surface bit change) resulting in CRC
> compression error when I do "tar -ztf ...".  So now based on this horrid
> experience I have a few questions.
>
> Will a CPU running linux automatically recover from a "panic"?  If so
> how long does it take and what is the outcome?  I just assumed that no
> system response means a dead machine, specially when no disk lights were
> flickering in the usual manner.
>
No, a kernel panic means that the system can not recover, and is
stopping.
>
> For linux how does one do a disk surface check. I understand that if
> there are sections of damage on a disk surface it is not unusual to set
> up the drive to recognize and bypass those areas?  Is there a way I can
> check this???
>
It depends on the drive, and the controller.  Badblocks will search a
file system for bad blocks.  SCSI controllers that have an onboard BIOS
will usulay have utilities that will find bad blocks, and map them out.

Most modern drives will also map out bad blocks automaticly.

> Third question.  There are numerous mail error messages that indicate
> the CPU encountered a 1 point increase over the 5 point maximum CPU load
> figure. This is a new one to me (for linux).  Can this not be controlled
> like on other systems where offending processes are limited in the
> amount of CPU resources utilized?   What will a CPU overload of this
> sort do?  I thought it would just slow down the machine?????
>
> My fourth question is kind of dumb but let me try anyway.  What would
> any of you suppose that if a fatal condition which occured frequently
> for 2 days would cause it to diminish in frequency?  I can now boot up
> without a problem, run without a problem, etc., as long as I don't ask
> the CPU to do the really heavy task I explained above.  And the CRC
> errors I mentioned don't cause a crash, maybe just indicate some other
> serious matter.
>
Heavy load could cause memory to fail, or it could cause a heat problem.
It can also cause some interesting power supply voltage changes.  This
is why I like motherboards with sensors that will report powersuply
voltages, as well as MB and CPU temps.
>
> Also, I think the problem has something to do with the RAID array, since
> the array runs UDMA 66 protocol and does not appear to have special
> cables.  But I don't know how to tell the cables apart (the special 66
> type versus regular IDE to disk drive ribbon cables.  I think maybey the
> software IDE-RAID operation is driving things at the 66 rate and maybe
> there is an impedance match because of the cables, they look like normal
> IDE cables.  BTW, there are two PROMISE IDE controllers in addition to
> what is on the CPU board which is an EPoX Mainboard.  If there is a
> cable impedance problem and the drives don't sense this (they are
> suppossed to sense this and drop down to 33 rate) then that might
> explain the problem when a lot of very fast data exchanges are occuring
> on those drives.  But if this were the case then one would have expected
> the problem to have show up when using those routines before the problem
> occurred.  In other words I used to do these operations before without
> any problem.  Anybody got any ideas.  I'm worried that the techi that
> gets this problem is going to do a replace and try it again process or
> not find any problem an I pay a fat fee, take it home, and go to do the
> heavy duty routines and away it crashes again. See my problem?????  If I
> was just a bit smarter.
>
Are you forcing the UDMA 66 mode in the BIOS, or with hdparms?
>
> Any ideas would be welcome.  I've got about 24 hours before I send it
> in.
>
> Bye-thanks _TED
>
>
I would check the the settings for the CPU, to make sure it is not
overclocked, and that the core voltage settings are correct.  Unless the
CPU/heatsink are a factory unit, I would check to see the surfaces
between the heatsink and the CPU.  Make sure there isn't anything
besides thearmal tape or thearmal compound between the heatsink and the
CPU.  (I have seen strange things like part number tags on the CPU!)
If the heat doesn't get from the CPU to the heatsink, you have troubles
when you put a load on the system.

Test the memory with memtest86.  This is a good way to find bad memory.
another good way to test memory is to compile the kernel.  If you
can run 4 or 5 cycles of "make dep clean bzImage" without errors, that
is a good sign.  Now, if you want to give the CPU and powersupply a good
workout, running the distributed net client will generate a fair amount
of heat in the CPU, as well as giving the CPU supply regulators a
workout.  I have used this to test CPU cooling many times.  If you have
a CPU temp monitor, you can watch the temp rise.  On some CPU/MB combos,
it will also do strange things with the -5 supply...


I hope my random thoughts help...
Mikkel
-- 

    Do not meddle in the affairs of dragons,
 for you are crunchy and taste good with ketchup.



_______________________________________________
Redhat-list mailing list
[EMAIL PROTECTED]
https://listman.redhat.com/mailman/listinfo/redhat-list
Re: Some complex issues - need a few Gurus

Reply via email to