On Mon, 2007-04-09 at 11:30 -0600, Matt Funk wrote: > The reason i want to run on 32 processor though, is that it takes (on > 32 procs) several hours till my program crashes. Also, i would like to > be able to keep the conditions under which it crashes intact as much > as possible (i.e. run on 32 procs rather than 1). > > Does anyone have any advice? I am open to try out other things as well > if possible. I am just starting to learn debugger techniques for a > parallel > program.
What you are trying to do isn't uncommon, some of us do it most days. having a job which exhibits the problem with only 32 procs and several hours isn't a bad reproducer, I've certainly seen much worse. Debugging at this scale isn't exactly interactive but it's small enough to me able to make timely progress. My advice would be first and foremost to look at the core file, I assume your program is receiving a SEGV and exiting? core files can be problematical, partly because they aren't always enabled and partly because to extract anything useful out of them you need to run the debugger with the same environment as the application was, this isn't always as easy as it sounds if you are using modules or something like that. Often looking at the stack trace at the point of the crash will give you a good clue as to where to look and most of the time further debugging is a thought processes so no more tools are needed. If you are having problems getting a stack trace in the normal way there are two techniques that can be used, firstly you can write a wrapper script to control the execution of your program, this can check the exit status and if a core file was generated extract a core file from it automatically and save it to disk somewhere. This is useful because it saves time and also is guaranteed to have the same environment as the application so avoids the problem I mentioned above. The other option is to catch SEGV in the application and have the signal handler print a message and spin allowing you to log-in and attach a debugger by hand, this is often best for complex problems where you want more state than can be automated but does require you to be present at the time of the crash which isn't great for reproduces which take several days to run :( printf() isn't nearly as useful in parallel applications as serial ones as it's hard to strike the right balance between printing the information you want and being drowned with information, multi-gigabyte log files are far to easy to generate using this method although as you close in on a bug printf does become more useful. All to often however simply adding printf changes the timing so much that bugs are no longer reproducible. As someone else mentioned Valgrind is a very useful tool, it should run on most clusters (assuming you are on i686 or x86_64) and if it doesn't send a mail to the valgrind-users list and I'm sure someone, quite possibly me, will help you to get it running. The downside is that it will make your code quite a bit slower and increase memory usage so this may not be an option for you but you should definitely try it if a simple core dump isn't giving you enough information. Other advice would be to set MALLOC_CHECK_=2 to enable integrity checking in the libc malloc implementation and if using ia64 download compile gdb from source otherwise you might find it's not all that accurate at times. TotalView and DDT are both great if you have a licence for either of them although I must confess to not having used them for the situation you describe. Ashley, _______________________________________________ Beowulf mailing list, [EMAIL PROTECTED] To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf