On 10/09/18 11:16, Joe Landman wrote:

If you have dumps from the crash, you could load them up in the
debugger.  Would be the most accurate route to determine why that was
triggered.

Thanks Joe, after a bit of experimentation we've now successfully got a crash dump. It seems to confirm what I thought was the case, in that the
process is off in kernel space dealing with an APIC interrupt (a timer
in this case) when a SIMD exception gets raised.

crash> bt
PID: 138341  TASK: ffff9fd7eb3c6eb0  CPU: 27  COMMAND: "shuangTwoPhaseE"
 #0 [ffff9ff02ee6bc38] machine_kexec at ffffffff938629da
 #1 [ffff9ff02ee6bc98] __crash_kexec at ffffffff93916692
 #2 [ffff9ff02ee6bd68] crash_kexec at ffffffff93916780
 #3 [ffff9ff02ee6bd80] oops_end at ffffffff93f1d738
 #4 [ffff9ff02ee6bda8] die at ffffffff9382f96b
 #5 [ffff9ff02ee6bdd8] math_error at ffffffff9382cca8
 #6 [ffff9ff02ee6be98] do_simd_coprocessor_error at ffffffff9382cec8
 #7 [ffff9ff02ee6bec0] simd_coprocessor_error at ffffffff93f28c9e
 #8 [ffff9ff02ee6bf48] apic_timer_interrupt at ffffffff93f26791
    RIP: 00002b1b5d406828  RSP: 00007fff1f596148  RFLAGS: 00000293
    RAX: 00000000000005c8  RBX: 0000000000002bce  RCX: 0000000002c979e0
    RDX: 00000000000005cb  RSI: 0000000002dcedf0  RDI: 00000000000000b9
    RBP: 00007fff1f5a25d8   R8: 0000000000002d00   R9: 00000000000000b4
    R10: 0000000000000000  R11: 00000000026bcb48  R12: ffff9ff05c1461e8
    R13: 0000000000000000  R14: ffff9ff05c146200  R15: 0000000000010082
    ORIG_RAX: ffffffffffffff10  CS: 0033  SS: 002b

The kernel code is pretty short for it, basically in the RHEL7 kernel
it comes down to:

Are we in user space?
No?  Oh dear.
Is there a fixup registered for this address?
No?  OK, goodbye cruel world...

I've reached out to the maintainers of the arch/x86/ part of the tree
in case they had any general ideas on whether this was all the kernel
could be expected to do.  Only feedback so far is that yes this is odd,
and a query to another developer regarding whether some additional
checks that are done for when the process is in user space might be
applicable if that process has called into the kernel at that point.

My suspicion is that is the process is off doing some AVX stuff when
the timer occurs and an exception is either generated or just happens
to be delivered from the AVX unit at a bad time.

Going to see if I can persuade Easybuild to compile OpenFOAM without
AVX-512 optimisations first and try (if that doesn't fix it) turn off
different things until the problem goes away.

All the best,
Chris
--
 Chris Samuel  :  http://www.csamuel.org/  :  Melbourne, VIC
_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to