Hi,
You may have seen a few posts from me on this box. I continue to try to
isolate the problem as much as possible with it and it's now narrow to a
more specific setup in the current kernel, but still this box WILL NOT
be stable what so ever if use with the amd64.mp kernel.
I am running out of ideas to narrow it more, so if anyone may have
suggestions as to what I could look for, I would appreciate it.
I start to think that it might be a run out of buffer space when writing
to the drives, I am not sure if that's logical or not.
However, I find a few ways to make it more stable, but not crash free.
All may be 100% related to the writing speed to the SAS drives.
To proof this point, or to discard it, I would like to find a way to
really control the writing speed and at the same time be able to monitor
the system variable, buffer, or what ever make sense to isolate it more.
Why I am saying this is because if I do transfer slowly, by using slow
old servers over the network, I can transfer big files to that Sun box,
but as soon as I increase the writing speed to the drive, I reach the
point when I crash it.
It is ALWAYS writing only that will crash it. Reading as badly as you
want, so far anyway, just doesn't trigger the problem.
I proof that by mounting partitions, RO and noatime in fstab. Yes, I
need both, then noatime is important in my tests anyway.
I can create a 10GB file and then do cp /var/test /dev/null and I will
transfer that file at 32MB/sec and it will do it and not crash. I can do
multiple read, etc.
However, as soon as ANY write is done as small as it might be when the
drive is very busy, or I guess may be the driver buffer, or control, or
what not that I am trying to isolate is full, or loaded, then a simple
small write to drive like echo 'test' >/tmp/test will crash that box
every time. Even a simple ssh access to that box, when it will try to
update the /var/log/authlog, it will go south.
I also was able to increase the writing speed to the drive before it
crash, or the size I can transfer before it crash if I also disable the
USB virtual support for the SCSI cdrom that is provided by the BIOS on
that box.
So, I am looking at ideas where I could possibly look to come out with
more details and possibly fix that box. It's much better then it was
with 4.2 release by far as many issues where fix, including auto
negotiation of network card, etc on that box. The short of it is that
box is not usable at all when you load the amd64.mp kernel on it, but is
now finally stable, or sure more resistant anyway when use the single
processor amd64, and so far, I haven't been able to crash it yet in more
then two months test if I use the i386 single or mp kernel.
But as far as I can see, there is still bug(s) to be found in the
amd64.mp kernel and I am looking for ideas as to narrow it down more.
I am running out of trucks so far and need more ideas.
Amy be some sysctl variable can be try to test my theory, may be not.
But based on my tests, it looks like that it might be some kind of
buffer that runs out, based on the fact that slower writing doesn't
crash it, and I want to proof or deny that, but do not know how, other
then doing it the way I did using slow speed computers, or put port
speed at 10mb as an example, etc. But that may also be a very stupid
idea, however looks possible.
Most likely there is something in the drive that crash it, however, my
understanding is that, the drive here is the same regardless if use in
the single processor, or mp kernel.
I try to see what might be different in kernel between the two that
might affect this, but I have to admit that at this point, I am over my
head to find witch part is the most logical part to look at.
So, an suggestions for testing that anyone might have would be welcome.
I totally give up on using the amd64.mp kernel on these boxes and I am
happily using the i386.mp, but I still would love to find the final
answers as almost all bugs for that box in the last 6 months have been
resolved. It's much better then it was, but not home free yet.
Best,
Daniel