I found a neat ... feature ... of Linux while getting g03 running in SMP on cluster nodes. Long story, but the folks I am doing this for don't have/want to use Linda. They asked us to help them get g03 operational in SMP parallel. This wasn't painful. Have it integrated into SGE and our SICE interface now as well.

Basic idea is that we are getting a kernel exception in the VFS layer only when running with 2 or more CPUs on an SMP node. Shows up only on SuSE 9.3 nodes. The other nodes are RHEL 3 based (2.4 kernel, but hey, its really stable).

I don't want to post a nasty-looking trap here.

The problem occurs with both xfs and jfs. Haven't had the chance to try ext3 yet, though if the issue is in the vfs layer, I can't see how changing the underlying block device is going to alter the layers (VFS) above it.

The net effect of this is that it runs great on the 2.4 based machines, but gets SIGKILLs when running on the 2.6 based SuSE 9.3 machines. Looks like the app is tickling the OS bug. I can repeatably cause this trap, though it seems to occur at "random" places, well, not really. The way Gaussian runs, it has "links" which are binary modules which execute a particular portion of the calculation (its pretty neat really). Each link is read in from the disk. This VFS bug gets triggered regardless of local or remote FS.

Any Gaussian users out there see that? Does a kernel upgrade fix it? Inquiring minds want to know ...

--

Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics LLC,
email: [EMAIL PROTECTED]
web  : http://www.scalableinformatics.com
phone: +1 734 786 8423
fax  : +1 734 786 8452 or +1 866 888 3112
cell : +1 734 612 4615

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to