[Beowulf] NFS Read Errors

Michael H. Frese Mon, 03 Dec 2007 16:56:59 -0800

We were having trouble restarting from our homegrown parallelmagnetohydrodynamic code's checkpoint files. The files could beread, but funny things happened in the run afterward. Eventually wefigured out that the restarted parallel run differed from the serialrestarted run from the same checkpoint.

After much gnashing of teeth and rending of apparel, we found thatthe checkpoint files were being read incorrectly across NFS. Thatlet us simplify our search for the problem. We first found that thelocal md5 digest [openssl dgst -md5 (file...)] on an NFS cp'edversion of the file was different from that produced on the originalfile. What was interesting was that the copy either took forEVER --like 10 minutes or 20 minutes for a 1 GB file -- when the finalresult was bad or it took about a minute when the file wasperfect. I'm guessing that whatever error checking that gets done onthe packets was rejecting so many it finally got a bad packet itcouldn't tell was bad.

When we found that doing the md5 digest on a remote file produced adifferent result than doing it on the processor on which the disk wasmounted, our tests got simpler. And shorter, still, after we foundthat we could get fairly frequent failures with 10 MB files orsmaller. Clearly we had an NFS failure, probably associated with hardware.

This was all between two specific nodes of our small cluster. [Oldhardware generally: AMD Athlon 32-bit single (MSI KT4V) and dual(Tyan...) chip motherboards both running Redhat 9 one with the2.4.20-8 kernels, though one is the smp version; NetGear GA311 NICs;and a NetGear GS108 8 port Copper 1 GB/s switch. The singleprocessor motherboards have 32-bit PCI slots so their network speedsare limited to 300 kbps as shown by netpipe. All of the LEDs at theends of the cables show 1000Mb connections.]

Then we started checking other pairs. Some were fine. Some were badin the same way. So we replaced the switch, changing to a 16 portNetGear GS216. That seemed to cure most of the problem. But wecontinued to have problems copying a file on one particular singleprocessor machine from the others.

That's where we are now. The md5 digest run on that machineconsistently shows the same result, whereas the digest for that fileproduced on a remote machine will be almost stochastic. In somecases it will eventually settle in to the right answer, and then thespeed goes WAY up. I suppose that happens because the file requestcan be served from the local machine's cache. But why doesn't ithappen after it received bad blocks?

Most, if not all of the original network cards in those machines wentbad and have been replaced in the last few years, so I decided to trya brand new GA311. No joy there. It still gives out the wronginfo. I guess the motherboard PCI bus controller is hinky, but I'mfar from sure.

We are in the process of upgrading and thus replacing all themachines we have of that configuration due to space limitations andtheir age, but I'm still curious what the problem could be.


Suggestions?  Comments?


Mike

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

[Beowulf] NFS Read Errors

Reply via email to