On 10/09/18 11:16, Joe Landman wrote:
If you have dumps from the crash, you could load them up in the debugger. Would be the most accurate route to determine why that was triggered.
Thanks Joe, after a bit of experimentation we've now successfully got a crash dump. It seems to confirm what I thought was the case, in that the
process is off in kernel space dealing with an APIC interrupt (a timer in this case) when a SIMD exception gets raised. crash> bt PID: 138341 TASK: ffff9fd7eb3c6eb0 CPU: 27 COMMAND: "shuangTwoPhaseE" #0 [ffff9ff02ee6bc38] machine_kexec at ffffffff938629da #1 [ffff9ff02ee6bc98] __crash_kexec at ffffffff93916692 #2 [ffff9ff02ee6bd68] crash_kexec at ffffffff93916780 #3 [ffff9ff02ee6bd80] oops_end at ffffffff93f1d738 #4 [ffff9ff02ee6bda8] die at ffffffff9382f96b #5 [ffff9ff02ee6bdd8] math_error at ffffffff9382cca8 #6 [ffff9ff02ee6be98] do_simd_coprocessor_error at ffffffff9382cec8 #7 [ffff9ff02ee6bec0] simd_coprocessor_error at ffffffff93f28c9e #8 [ffff9ff02ee6bf48] apic_timer_interrupt at ffffffff93f26791 RIP: 00002b1b5d406828 RSP: 00007fff1f596148 RFLAGS: 00000293 RAX: 00000000000005c8 RBX: 0000000000002bce RCX: 0000000002c979e0 RDX: 00000000000005cb RSI: 0000000002dcedf0 RDI: 00000000000000b9 RBP: 00007fff1f5a25d8 R8: 0000000000002d00 R9: 00000000000000b4 R10: 0000000000000000 R11: 00000000026bcb48 R12: ffff9ff05c1461e8 R13: 0000000000000000 R14: ffff9ff05c146200 R15: 0000000000010082 ORIG_RAX: ffffffffffffff10 CS: 0033 SS: 002b The kernel code is pretty short for it, basically in the RHEL7 kernel it comes down to: Are we in user space? No? Oh dear. Is there a fixup registered for this address? No? OK, goodbye cruel world... I've reached out to the maintainers of the arch/x86/ part of the tree in case they had any general ideas on whether this was all the kernel could be expected to do. Only feedback so far is that yes this is odd, and a query to another developer regarding whether some additional checks that are done for when the process is in user space might be applicable if that process has called into the kernel at that point. My suspicion is that is the process is off doing some AVX stuff when the timer occurs and an exception is either generated or just happens to be delivered from the AVX unit at a bad time. Going to see if I can persuade Easybuild to compile OpenFOAM without AVX-512 optimisations first and try (if that doesn't fix it) turn off different things until the problem goes away. All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Melbourne, VIC _______________________________________________ Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf