Re: [Beowulf] NFS Read Errors

Joe Landman Mon, 03 Dec 2007 17:38:18 -0800

Hi Michael:

Michael H. Frese wrote:

We were having trouble restarting from our homegrown parallelmagnetohydrodynamic code's checkpoint files. The files could be read,but funny things happened in the run afterward. Eventually we figuredout that the restarted parallel run differed from the serial restartedrun from the same checkpoint.
After much gnashing of teeth and rending of apparel, we found that thecheckpoint files were being read incorrectly across NFS. That let ussimplify our search for the problem. We first found that the local md5digest [openssl dgst -md5 (file...)] on an NFS cp'ed version of the file


        md5sum filename

does the same thing with a slightly simpler syntax. There is mountingevidence that you should use sha1sum rather than md5sum.

was different from that produced on the original file. What wasinteresting was that the copy either took forEVER -- like 10 minutes or20 minutes for a 1 GB file -- when the final result was bad or it tookabout a minute when the file was perfect. I'm guessing that whatevererror checking that gets done on the packets was rejecting so many itfinally got a bad packet it couldn't tell was bad.

Sounds a great deal like a bad disk/disk system or something muckingwith your connection to the data. 1 GB file, even at 1 MB/s is 1000seconds, or 16 minutes. If you have a disk which keeps timing out, orhas bad blocks, and keeps retrying, well, stuff like this can happen,especially on old kernels (and old hardware).


Could also be a RAM error.

When we found that doing the md5 digest on a remote file produced adifferent result than doing it on the processor on which the disk wasmounted, our tests got simpler. And shorter, still, after we found thatwe could get fairly frequent failures with 10 MB files or smaller.Clearly we had an NFS failure, probably associated with hardware.

Yes. I would venture a guess that you are seeing *lots* of errors inyour /var/log/syslog or /var/log/messages files.

This was all between two specific nodes of our small cluster. [Oldhardware generally: AMD Athlon 32-bit single (MSI KT4V) and dual(Tyan...) chip motherboards both running Redhat 9 one with the 2.4.20-8kernels, though one is the smp version; NetGear GA311 NICs; and a


Owie...

NetGear GS108 8 port Copper 1 GB/s switch. The single processormotherboards have 32-bit PCI slots so their network speeds are limitedto 300 kbps as shown by netpipe. All of the LEDs at the ends of thecables show 1000Mb connections.]

300 kbps? thats 300 kilo bits per second (abbreviations are *very*important to get right, kB/s is not the same as kb/s). 300 kbps isusually read as 300 kilo bits per second. Or about about 37.5 kB/s.Which is about the average speed of various DSL lines.


I hope you mean 30 MB/s (or 240 Mb/s).

Then we started checking other pairs. Some were fine. Some were bad inthe same way. So we replaced the switch, changing to a 16 port NetGearGS216. That seemed to cure most of the problem. But we continued to


We have seen bad switches a few times.

have problems copying a file on one particular single processor machinefrom the others.
That's where we are now. The md5 digest run on that machineconsistently shows the same result, whereas the digest for that fileproduced on a remote machine will be almost stochastic. In some casesit will eventually settle in to the right answer, and then the speedgoes WAY up. I suppose that happens because the file request can beserved from the local machine's cache. But why doesn't it happen afterit received bad blocks?

I am guessing you are using TCP NFS mounts as well? TCP forces retriesin the event of bad packets. UDP doesn't force this, but the NFSprotocol will try. Ram errors, bad cables, burnt switches, and machineswith interrupt problems (old machines often shared interrupts withoutbeing able to do a very good job of it).

Most, if not all of the original network cards in those machines wentbad and have been replaced in the last few years, so I decided to try abrand new GA311. No joy there. It still gives out the wrong info. Iguess the motherboard PCI bus controller is hinky, but I'm far from sure.

Did you try a new cable? Had a few cables go bad, usually they aremarginal to begin with.

We are in the process of upgrading and thus replacing all the machineswe have of that configuration due to space limitations and their age,but I'm still curious what the problem could be.

There are quite a few possibilities unfortunately. Unless you plan touse these existing machines for quite a while longer, it might be lesspainful to shut off the malfunctioning node.


Suggestions?  Comments?


2.4.20?  Athlons?  I would say a serious hardware/OS refresh is in order :)



--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: [EMAIL PROTECTED]
web  : http://www.scalableinformatics.com
       http://jackrabbit.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 866 888 3112
cell : +1 734 612 4615
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] NFS Read Errors

Reply via email to