On 12/07/2010 04:35 PM, David Mathog wrote: >> True, but this is a multi-user system, so I don't know which user's code >> is triggering the errors, nor do I know what usage pattern causes the >> errors, so I'm looking for something more consistent. Well, I hope it >> will be more consistent. > > Try setting up a script to take snapshots of the system every 15 seconds > or. Something like: > > do while [ 1 ] > ( date; top -b -n 1 | head -10 )>>$LOGFILE > sleep 15 > done > > Then using the memory error time stamps go back through those logs to > find the most likely culprits.
That will identify the program, but not the problem size or data set being used that triggers the error. Using a stress test that I control removes this detective work. I've decided to go with mprime from the gimps project which has a stress test feature: http://www.mersenne.org/ -- Prentice _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf