Matt Funk <[EMAIL PROTECTED]> wrote: > The reason i want to run on 32 processor though, is that it takes (on 32 > procs) several hours till my program crashes. Also, i would like to be able > to keep the conditions under which it crashes intact as much as possible > (i.e. run on 32 procs rather than 1).
> > Does anyone have any advice? This is pretty general... My advice is to be sure the code is absolutely as clean and standard compliant as possible before you touch a debugger. That means add -Wall -pedantic -std=c99 (for gcc, or as appropriate for your compiler) and don't stop until every bit of it compiles without a single warning. Then run it through valgrind or the equivalent and fix every memory problem it finds. Then, and only then, try your long run again. If you're lucky this will fix the problem and you won't have to debug anything. Also a comment - if your program crashes pretty much by definition it is not doing sufficient error checking. Rather than "kaboom!" a well written program will emit an "could not allocate memory" or "invalid pointer" message and then exit gracefully. Yes, I know that level of error checking is often left out of inner loops for speed reasons. Assuming that your code has a fairly fast cycle, so that several hours represents many, many cycles, you're almost certainly looking for either an invalid memory access, a memory leak, or running some loop counter past the end of the loop (for instance, via an unhandled condition.) Valgrind can help you find some of these. If it does any file IO you might also be using up all the file descriptors. (Saw that once in a version of NCBI BLAST, where it kept opening a gi file and not closing it before the next open.) If all of that fails, and you have easy access to another cluster with a completely different architecture, try building and running there. Often a subtle problem on one CPU type stands out like a sore thumb on another. If all that fails, then Joe is probably right, start with the dumps and work backwards to at least find out where in the code the crash is taking place. Or run each instance with strace, but be sure to log the output to for each compute node to a local file on that node. Then you can put in print statements in the relevant locations and run again. Just don't be surprised if, if the code is optimized, those print statements themselves "cure" the problem. Regards, David Mathog [EMAIL PROTECTED] Manager, Sequence Analysis Facility, Biology Division, Caltech _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf